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1.0 INTRODUCTION 


Statistics is an area of study that deals with collecting, organising, 
analyzing, interpreting and presenting data. The study of statistics has a 
lot of applications in industries, agriculture, medicine etc. In this unit, 
you will learn about the various importance and scope of statistics. You 
will also come to know about types of data, ways of collecting them and 
also how to present data. 


1.1 OBJECTIVES 


After going through this unit, you will 


e Understand the meaning , importance and functions of statistics 
e Learn the different types of data and how to collect them. 
e Know how the data collected can be presented 
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1.2 STATISTICS 


The word statistics of English language have been derived from the 
Latin word status or Italian word ‘statista’ or German word ‘statistik’. In 
each case it means "an organised political state”. Although, in the past, 
statistics was considered as the "science of statecraft" as it was used by 
the government of various States to collect data regarding population 
„births , deaths, taxes etc.,. Statistics, nowadays, have experienced a 
modern development. Statistics play a crucial role in enriching a specific 
domain by collecting data in that field, analyse the data by applying 
various statistical techniques and making inferences about the same.For 
example, knowing the average height of the students will enable the 
engineer to know about the size of the door. 


1.2.1 DEFINITION OF STATISTICS 


The definition of statistics can be expressed in two ways to cover 
two different concepts. They are 


1. Statistics as numerical data 
2. Statistics for statistical method 


1. Statistics as numerical data 


When the word ‘statistics’ is used in plural sense, it refers to the 
collection of numerical data. 


For example: - Export or Import quantity, Foreign Direct Investment, 
etc..,. 


According to Webster,” statistics are classified facts representing 
the conditions of the people in a state especially those facts which can be 
stated in number or in table of numbers or in any tabular or classified 
arrangements" 


This definition of Webster reveals that only numerical facts can be 
termed statistics. This is an old, narrow and inadequate definition for 
modern times. 


According to Bawley “Statistics are numerical statement of facts in 
any department of inquiry placed relation to each other" 


Here, Bowley says that statistics is the science of counting and 
ignores other aspects such as analysis, interpretations etc..,. 


According to Yule and Kendall,” By statistics we mean 
quantitative data affected to a market extent by multiplicity of cause" 


Yule and Kendall’s definition tells us that numerical data is 
affected by multiplicity of cause. For example, the cost of production is 
affected by wage cost, exchange rate, raw material etc..,. 


According to Professor Horace Secrist," It is the aggregate of 
facts affected to mark extent by multiplicity of causes, numerically 
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expressed, enumerated or estimated according to a reasonable standard of 
accuracy, connecting in a systematic manner for the predetermined 
purpose and placed in relation to each other" 


Secrist’s definition for statistics is more complete. The vital point that 
the definition covers are 


1) Aggregate of facts 

2) Affected by multiplicity of cause 

3) Numerically expressed 

4) Estimated according to standard of accuracy 
5) Systematic Collection of data 

6) Data collected for a predetermined purpose 
7) Comparable 


2. Statistics as Statistical Methods 


According to Bowley,” Statistics the science of measurement of 
social organism, regarded as a whole in all its manifestation" 


This definition of Bowley is insufficient 


According to Wallis and Roberts," Statistics is a body of methods 
for making wise decision on the face of uncertainty" 


This definition is modern as it conveys statistical methods enable 
us to arrive at valid decisions. 


According to Croxton and Cowden” statistics must be defined as 
the science of collection, presentation, analysis and interpretation of 
numerical data” 


This definition gives a more elaborate meaning to statistics as 
statistical tools. 


1.2.2 IMPORTANCE OF STATISTICS 


Statistics can be used to various areas of business operations for 
effective results. Some prominent areas are given below. 


1) Startups - While opening a new business or acquire one, we 
need to study the market from a statistical point of view to get 
accuracy in the market demand and supply .A businessman must 
do proper research by collecting data, analyzing and interpreting 
them regarding market trends before starting his business. 

2) Production - The production of the commodity depends upon 
various factors such as demand, supply of capital etc..,. These 
factors must be analyzed statistically to get a precise and accurate 
view of the same. 

3) Marketing - An ideal marketing strategy requires statistical 
analysis on population, income of consumers, availability of the 
product ect...,. 
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4) Investment - Statistics play a vital role in making decisions 
regarding buying shares, debentures or real estate. Using this 
statistical data, an investor will buy investments at a lesser price 
and sell when the price increases. 

5) Banking - Banking sector is highly influenced by economic and 
market conditions. Bank have separate research department which 
collect and analyse information regarding inflation rate, interest 
rates, bank rates etc...,. 


1.2.3 LIMITATIONS OF STATISTICS 
1) Statistics does not analyse qualitative phenomenon 


As statistics is a science which deals with numerical, it cannot be 
applied in data that cannot be measured in terms of quantitative 
measurements. However statistical techniques can be used to convert the 
qualitative data to quantitative data. 


2) Statistics does study individuals 


Statistics deals with aggregate quantities and doesn't give 
importance to individual data. This is because individual data is not 
useful for statistical analysis. 


3) Statistical laws are not exact 


Statistical interpretations are based on averages and hence are only 
approximations can be made 


4) Statistics may be misused 


Statistical data when used by an inexperienced person or illiterate 
person can lead to wrong interpretations. Hence it must be used only by 
experts. 


1.2.4 FUNCTIONS OF STATISTICS 
1) Consolidation 


Statistics enables you to consolidate and understand huge data by 
providing only significant observations. 


For example, instead of observing the marks of each and every individual 
with class average will enable you to know the class's performance as a 
whole. 


2) Comparison 


Classification and tabulation of data are used to compare the data. 
Various statistical tools such as graph, measure of depression dispersion, 
correlation gives us huge scope for comparison. 


For example, the market demand for a product can be compared among 
the states. This enables the company to identify and analyse the target 
market. 


3) Forecasting 


Forecasting means predicting the future prospects. Statistics plays a 
huge role in forecasting the future. 


For example, with the data of the sales value for the past 10 years, we 
will be able to predict the sales of the coming year approximately. Time 
series analysis and regression analysis are important for forecasting. 

4) Estimation 


One of the main aims of statistics is to draw conclusions on a huge 
population based on the analysis from a sample group. 


For example, from a sample height of 10 students will be able to estimate 
the average height of all the students from the class. 


5) Test of hypothesis 


Statistical hypothesis is portraying a huge population from the 
inferences of a sample observation. 


For example, if a particular fertilizer helps in increasing the crop yield in 
a particular area then it will be used in other areas based on this sample. 


1.2.5 SCOPE OF STATISTICS 
1) Statistics in Industries 


Statistics is extensively used in huge number of industries. 
Statistics may be used in sales forecasting, consumer preference, quality 
control, inventory control, risk management etc. Sampling is vital for 
inspection plans. 


2) Statistics in Education 


Statistics plays an important role in education. Statistics help in 
measuring and evaluating the progress of the student, formulating 
policies and also helps to predict the future performance of the students 
to help them improve in the same. 


3) Statistics in Economics 


Statistics helps us to understand and analyse economic theories. 
Right from analysing microeconomic factors like the demand for the 
product, research regarding different markets to macroeconomic concept 
like inflation, unemployment can be done easily using statistics. 


4) Statistics in Medicine 


Statistics helps in researching and analysing medical experiments 
and investigations. Biostatic enables researchers to identify if a particular 
treatment or drug is working and how effective it is. 


5) Statistics in Modern Application 
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A lot of software’s are developed day to day for experimentation, 
forecasting and estimation. 


For example, SYSAT is one such software which provides with scientific 
and technical graphical options. 


6) Statistics in Agriculture 


Statistics can be applied in agriculture by analysing the 
effectiveness of fertilizers. It can be used in taking decisions regarding 
inputs and outputs, inventories etc..,. 


1.3 DATA 


Data are pieces of factual information that are recorded and applied 
for analysis. Data is a tool which helps us to understand certain problems 
by providing us with information. They are a set of values with 
qualitative and quantitative variable. 


1.3.1 TYPES OF DATA 


Data of broadly classified into two based upon who collected the 
data 


Primary data 


Primary data is the data collected by investigator himself for the 
first time for his own research and analysis. It is also known as first-hand 
information. Primary data is collected using method such as personal 
interview, survey etc..,. 


Secondary data 


Secondary data is the data which is already been collected and 
process by the person for the purpose of his research. Journals, internal 
sources, journals, book etc..,. are sources of secondary data. 


CHECK YOUR PROGRESS -1 
1. What are the two ways in which statistics can be defined? 


2. What is the definition of statistics according to Professor 
Horace Secrist? 


3. How does statistics help in comparing data? 


4. What is the role of statistics in medicine? 


5. What is secondary data? 


1.4 DATA COLLECTING TECHNIQUES 


1.4.1PRIMARY DATA 
1) Direct Personal Investigation 


Direct personal investigation is the method in which the 
investigator directly goes to the source to collect information. 


Merits 


(i) Information collected in this method is more authentic and 
accurate 

(ii) There is high degree of accuracy in qualitative information 

(iii) The original opinion or data shall be obtained. 


Demerits 


(i) This is a time consuming process 

(ii) If the investigator is not intelligent enough to understand the 
mental state of the source it may lead to wrong interpretation. 

(ii) It may result in personal bias. 


2) Indirect Oral Investigation 


Indirect oral investigation is when the investigator investigates a 
person close to the source. This is done due to the reluctance of the 
original person. 


Merits 


(i) It saves time and labour 
(ii) It is easy and convenient 
(iii) It covers a wide range of area. 


Demerits 


(i) Information received may not be reliable 

(ii) Person chosen for this purpose me not be suitable 

(ii) It may be expensive as information is collected from various 
sources. 


3) Information collected from local agencies 


In this method investigator appoints a few agencies in various 
regions to cover various fields of inquiry. This method is generally used 
by newspaper companies to get information from various places in 
various topics such as sports, economics etc..,. 


Merits 


(i) Avoid area can be easily covered 
(ii) This is a time saving method of collecting data 
(ii) The cost of collecting data is less 

7 


Statistics 


NOTES 


Self-Instructional Material 


Statistics 


NOTES 


Self-Instructional Material 


Demerits 


(i) Sometimes the information collected may contradict one another 

(ii) The information can be less accurate 

(i) This method will be expensive and a full-time agent is hired in 
different places 


4) Questionnaire method 


Questionnaire method is the most famous method of collecting 
primary data .A questionnaire is a set of questions device for conducting 
survey. The questionnaire is sent to the respondent with the request to fill 
it and send it back within a specific time. 


Merits 


(i) This method is cheaper 
(ii) The time consumed for this process is very less 
(iii) This is an unbiased method of collecting data 


Demerits 


(i) Sometimes the respondent may provide wrong information 

(ii) There is no type of personal motivation in this method 

(iil) There are chances of ignorance or late reply from the 
respondents 


General principles of framing a questionnaire 
1) The questionnaire must not be very long 


We must try to give the questions as minimum as possible. Long 
questionnaire may lead to boredom or discontentment among the 
respondents. 


2) The question must move from general to specific 


When the question moves from general to specific respondent 
become more comfortable in answering the questions 


3) The question should be ambiguous 


The questions must be in such ways that the respondents are able to 
give clear and quick answers to the questions 


4) The person should not contain double negatives 


Words like don't you or wouldn't you must not be used in the 
questions as they might tempt the respondent to give a biased answer. 


5) The question should not be lending questions 


The questions should not give clues to the respondent on how they 
must answer it. 


6) The question must not provide alternators for the answer 
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For example, instead of asking would you like to do engineering or 
medicine after class 12, the correct way of asking the question is would 
you like to do engineering? 


1.4.2 SECONDARY DATA 
1) Published sources 


Certain government and non-government organisations publish 
various journals, research papers, surveys etc which are very helpful and 
reliable. Some of them are mentioned below 


(i) Publications of international bodies like UNO, WTO and WHO 
etc..,. 

(ii) Publications of research institutes like ISI, NCERT, ICAR etc..,. 

(i) Government publications 

(iv) Publications of commercial and financial institutions 

(v) Publications of governmental organisations 

(vi) Newspaper, journals and periodicals. 


2) Unpublished sources 


Unpublished sources cover all the sources where data is maintained 
privately by certain private agencies or companies. The data collected by 
universities, research institutions also come under unpublished sources. 


1.5 PRESENTATION OF DATA 


In the previous topic we saw how data can be collected .As the data 
collected is generally huge we need to comprise and deliver it in a 
presentable form. Generally there are three ways of presenting 
presentation of data. They are 


1) Textual or Descriptive Presentation 
2) Tabular Presentation 
3) Diagrammatic Presentation 


1.5.1 Textual or Descriptive Presentation 


When the data collected is presented in the form of a text it is 
called textual or descriptive presentation. Generally this method cannot 
be used to present large data. 


For example, in the 2011 census, the population of India was 
1,21,08,54,977 comprising of 58, 64, 69,174 females and 62, 37, 24,248 
males. The literacy rate is 74.0 4 percentage and density of population is 
382 person per square kilometer. 


From the above example, we can see that the data is represented 
textually. One of the major limitations of this method is that the readers 
must go through the entire text and get the required information. 


1.5.2 Tabular Presentation of Data 
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When the data is presented in the form of rows and columns it is 
called tabular presentation of data. 


Example: 

AREA FEMALE MALE TOTAL 
URBAN 90% 89% 89.5% 
RURAL 87% 88% 87.5% 
TOTAL 88.5% 88.5% 88.5% 


The about table represents the pass percentage of the examination 
conducted in Tamilnadu it has three rows (urban, rural, total) and three 
columns (female, male, total). It is a 3x3 table where each small box is 
called the cell which gives information regarding the pass percentage. 
This method is very significant as it enables us to use it for further 
statistical treatment. This tabular representation is further classified into 
four 


(i) Qualitative Classification 


Qualitative classification is when the collected information is 
classified in the form of attributes such as gender, nationality etc..,. The 
table given above is an example of qualitative classification where the 
information is classified in the form of gender and location. 


(ii) Quantitative Classification 


When information can be measured quantitatively like age, income, 
marks etc..,.then, such classifications are called quantitative classification 


Example 
MARKS FREQUENCY 
0-10 5 
10-20 10 
20-30 20 
30-40 15 
40-50 10 


(iii) Temporal Classification 


Temporal classification is when classification is based on the basis 
of time like year, months, days etc..,. 
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Example 
DAYS OF A WEEK | PRODUCTION (no of pairs of shoes) 
MONDAY 2000 
TUESDAY 1750 
WEDNESDAY 3000 
THURSDAY 2250 
FRIDAY 1550 


(iv) Spatial Classification 


Spatial classification is when the data classification is based on 


place like town, city, district, state, country etc..,. 


Example 
STATE LITERACY RATE 
TAMIL NADU 80.09% 
ANDHRA PRADESH 67.02% 
KARNATAKA 75.36% 
KERALA 93.91% 


1.5.3 Diagrammatic Presentation 


In this method the data is represented diagrammatically and is very 
easy to understand generally data is represented diagrammatically in 


three ways. 


1) Geometric Diagram 


This category consists of bar diagrams and pie charts 


(i) Bar diagram 


Bar diagram is a diagrammatic representation of data in equal 
spaced and equalwidth rectangular bars for each class of data .The 
height or length of the bar tells us about the magnitude of the class. 
Bar diagrams can be easily used for comparison of data. Both 
qualitative and quantitative data can be represented in bar diagram. 
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They can be further divided into two broad categories. 
a) Multiple bar diagram 


When there is a need to compare two set of data multiple 
bar diagram is used. For example import and export, production 
and sale etc...,. 


b) Component bar diagram 


Component bar diagram also known as Sub diagrams are 
used to compare different components of a particular class. For 
example, the various components such as rent, medicine, 
education on which the monthly salary spend can be easily 
understood from a component bar diagram. 


(ii) Pie diagram 


A pie diagram is similar to that of a component bar diagram but 
it is represented in circle proportionally instead of bars. The values 
given in each class is converted into percentage and then each figure is 
multiplied by 3.6 degree. (360/100 - 360 degree of a circle divided 
into 100 parts) the values are then divided accordingly in the circle. 


2) Frequency diagram 


When the data is in the form of grouped frequency are usually 
represented by frequency diagrams. Histogram, frequency polygon, 
frequency curve and ogive are types of frequency diagram. 


(i) Histogram 


Histogram is a diagram which consists of rectangular bars 
whose area is proportional to the frequency of a variable and 
whose width is equal to the class interval. 


(ii) Frequency polygon 


A frequency polygon is another type of frequency 
distribution graph. In a frequency polygon, the number of 
observations is marked with a single point at the midpoint of each 
and every interval. Then the points are connected using a straight 
line. 


(iii)Frequency curve 
The frequency curve is obtained by drawing a smooth 
freehand curve that passes through the points of a frequency 
polygon closely as possible. 
(iv) Ogive 


Ogive also known as the cumulative frequencies are of two 
types. When the cumulative frequencies are plotted against their 
upper limits respectively, then it is less than ogive. When the 
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cumulative frequencies are plotted against their lower limits 
respectively, then it is more than ogive. 


3) Arithmetic line graph 


An arithmetic line graph also known as time series graph is a graph 
where the time ( months, years, weeks) are plotted in the x axis and their 
respective values are plotted in the y axis. It helps us in analysing trends 
and periodicity of data. 


CHECK YOUR PROGESS - 2 
6. What is indirect oral investigation? 
7. State two merits of questionnaire method 
8. Give some examples of published sources 
9. What is component bar graph? 


10. What is spatial classification? 


1.6 SUMMARY 


e The word ‘statistics’ is used in plural sense refers to the collection 
of numerical data and in singular sense it means the science of 
collecting, classifying and using statistics 

e Statistics can be used to various areas of business operations such 
as start-ups, production, and marketing for effective results. 

e Data is a tool which helps us to understand certain problems by 
providing us with information. It can be further divided into 
primary and secondary data. 

e Direct personal investigation, indirect oral investigation, 
questionnaire methods are some of the methods of collecting 
primary data. Publications of international bodies, research 
institutions are methods of collecting secondary data. 

e Data can be presented in three ways. They are Textual or 
descriptive presentation, Tabular presentation, Diagrammatic 
presentation. 


1.7 KEY WORDS 


Statistics, data, Primary data, Secondary data, Direct personal 
interview, Indirect oral investigation, Questionnaire, Qualitative, 
Quantitative, Temporal, Spatial, Bar diagram, Pie diagram, Histogram, 
Frequency Polygon, Frequency curve, Ogive , Arithmetic line graph. 
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1.8 ANSWERS TO CHECK YOUR PROGRESS 


1. 


9. 


10. 


The word ‘statistics’ is used in plural sense refers to the collection 
of numerical data and when in singular sense it means the science 
of collecting, classifying and using statistics 

According to Professor Horace Secrist," It is the aggregate of 
facts affected to mark extent by multiplicity of causes, 
numerically expressed, enumerated or estimated according to a 
reasonable standard of accuracy, connecting in a systematic 
manner for the predetermined purpose and placed in relation to 
each other 

Classification and tabulation of data are used to compare the data. 
Various statistical tools such as graph, measure of depression 
dispersion, correlation gives us huge scope for comparison. 
Statistics helps in researching and analysing medical experiments 
and investigations. Biostatic enables researchers to identify if a 
particular treatment or drug is working and how effective it is. 
Secondary data is the data which is already been collected and 
process by the person for the purpose of his research. 


. Indirect oral investigation is when the investigator investigates a 


person close to the source. This is done due to the reluctance of 
the original person. 

Questionnaire method 

(i) This method is cheaper 

(ii) The time consumed for this process is very less. 

Publications of international bodies like UNO, WTO and WHO, 
Publications of research institutes like ISI, NCERT, ICAR, and 
Government publications. 

Component bar diagram also known as Sub diagrams are used to 
compare different components of a particular class. 

Spatial classification is when the data classification is based on 
place like town, city, district, state, country etc..,. 


1.9 QUESTIONS AND EXERCISE 


SHORT ANSWER QUESTIONS 


1. Write short notes about the types of date 
. List the merits and demerits of direct personal interview 
3. What are the general principles followed while framing a 
questionnaire? 
4. Write about the classification of tabular presentation of data. 
5. What is a bar diagram? What are its types? 
LONG ANSWER QUESTIONS 
1. Analyse the importance and scope of statistics 


2. 


Explain in detail about the data collection techniques used in 
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primary data. 


3. Discuss about the functions and limitations of statistics. 
4. Explain the various methods used for presentation of data. 
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2.17 Further Readings 


2.0 INTRODUCTION 


Measures of central tendency are a statistical tool used to 
summarize data that depicts the central value of the given data. These 
measures enable us to identify where most of the values fall. The 
three most commonly used measures of central tendency are mean, 
median and mode. In this unit you will learn about them extensively 
and also learn about some other partition values. 
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2.1 OBJECTIVES 


From this unit you will 


e Learn about the measures of central tendency 

e Come to know about the various methods of calculating 
mean, median and mode. 

e Know about the partition values. 


2.2 MEASURES OF CENTRAL TENDENCY 


When working on a given set of data, it is not possible to 
remember all the values in that set. But we require inference of the data 
given to us. This problem is solved by mean, median and 
mode. Measures of Central Tendency, represent all the values of the 
data. As a result, they help us to draw an inference and an estimate of all 
the values. They are also known as statistical averages. Their simple 
function is to mathematically represent all the values in a particular set 
of data. Hence, this representation shows the general trend and 
inclination of all the values. 


Ko) 


0.2 0.4 06 08 1.0 1.2 1.4 16 1.8 20 2.2 


An average provides a simple way of representation of all the 
individual data. It also aids in the comparison of different groups of 
data. In addition to this, an average in economic terms can represent the 
direction an economy is headed towards. Hence, it can be easily used to 
formulate policies and bring about a reform for a better economy. 


2.3 MEAN 
2.3.1 ARITHMETIC MEAN 


The arithmetic mean of a series of numbers is sum of all 
observations divided by the total number of observations in the series. 


Example: 


There are two brothers, with different heights. The height of the 
younger brother is 138 cm and height of the elder brother is 154cm. The 
average height of the two brother is total height divided into two equal 
parts, 


(138+154) +2 = 292 + 2 = 146 cm 
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So 146 cm is the average height of the brothers. Here 154 > 146 
> 138. The average value lies in between the minimum value and the 
maximum value. 


Thus if x1, x2, ..., Xn represent the values of n observations, 
then arithmetic mean (A.M.) for n observations is: (direct method) 


There are two methods for computing the arithmetic mean: (i) 
Direct method (ii) Short cut method. 


Direct Method: 
Example: 


The following data represent the number of books issued in a college 
library is selected from 7 different days 17,1 9, 22, 25, 15, 40, 21 find 
the mean number of books. 


Solution: 
— l n 
x =n x 
X = 204+ 39 + 22 +25 + 45 + 40 + 54 = 245 =35 


7 7 
Hence the mean of the number of books is 35 
Indirect Method: 


In this method an assumed mean or an arbitrary value (A) is used as 
the basis of calculation of deviations (di) from individual values. If di 
= Xi - A 


Example: 


A student’s marks in 5 subjects are 95, 78, 88, 72,99. Find the 
average of his marks. 


Let us take the assumed mean, A = 88 


Xi di= x;- 88 
95 7 

78 10 

88 0 


18 


72 -16 
99 10 
Total 11 


Solution: 


=88 + =88 +5.5= 93.5 


The arithmetic mean of average marks is 93.5 


Discrete Grouped data 


If xl, x2, ....xn are discrete values with the corresponding 
frequencies f1, f2, ..., fn. 


Then the mean for discrete grouped data is defined as (direct 
method) 


2 Aix 


— i=l 


M N 


In the short cut method the formula is modified as 


5 fidi 


x =A ao <a where d;=x-A 


Example: 


Given the following frequency distribution, calculate the arithmetic 
mean 


Marks 64 | 63 | 62 | 61 | 60 | 59 
No. Of. Students | 8 1811219 |7 |6 
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Solution: 
Xj fi fi Xi dj=x;- A fid; 
(A=62) 

64 8 512 2 16 

63 18 1134 1 18 

62 12 744 0 0 

61 9 549 -1 -9 

60 7 420 -2 -14 

59 6 354 -3 -18 

60 3713 -7 
Direct Method 
z= Èi fixi 
N 
x = 3713 / 60 =61.88 
Short cut method 
n fd, 
x= At oe txec 
Here A = 62 
x=62— 7 =61.88 
60 


The mean mark is 61.88 


Mean of continuous Grouped data: 


Direct method 
x= melt x; is the midpoint of the class interval 
Short cut method 
Èi- fidi 
x= A+ —— 
x N xC 


Where A — any arbitrary value 
c - Width of the class interval 
xi -is the midpoint of the class interval 


Example: 


For the frequency distribution of yield of tomato given in table 


calculate the mean yield per plot. 


a perplot (in | 64 -84 | 84 - 104 | 104-124 | 124-144 
No of plots 3 5 7 20 
Solution: 
Yield (in Kg) No of plots ( fi) Mid x; fi Xi d= (x; - A)/c fid; 
64 -84 3 74 222 -1 -3 
84 - 104 5 94 470 0 0 
104 — 124 7 114 798 1 7 
124-144 20 134 2680 2 40 
Total 35 4170 44 
Direct Method 
z= vin fixi 
N 
X= 4170 / 35 = 119.143 
Short cut method 
Xj- fid 
¿= A4 251 
X + N xc 
X =94 + 44 xc =119.143 
35 


3.3.2 WEIGHTED ARITHMETIC MEAN 


For calculating simple mean, all the values or the sizes of items in 
the distribution have equal importance. But in practical life this may not be 
so, in case some items are more important than others, a simple average 
computed is not representative of the distribution. Proper weightage has to 


be given to the various items. 


For example a student may use a weighted in order to calculate their 
percentage grade in a course, in this the student would multiply the 
weighing of all assessment items in the course( eg: assignment, exams, 
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projects, etc.)by respective grade that was obtained in each of categories 


The average whose component items are being multiplied by 
certain values known as “weights” and the aggregate of the multiplied 
results are divided by the total sum of their “weight” 


Let X),X2,.....X, be the set of n values having weights wl,w2.....,wn 
respectively, 


then the weighted mean is 


n 
Wy X1 E W Xg F reece Wn Xn Die Wi Xi 


w= n 
Wy + W2 + W3 toee +W1 Dr Wi 


Example: 


A student obtained the marks 40,50,60,80, and 45 in math, 
statistics, physics, chemistry and biology respectively. Assuming 
weights 5,2,4,3, and 1 respectively for the above mentioned subjects, 
find the weighted arithmetic mean per subject. 


Solution 
Components | Marks scored ( x; ) | Weightage (w; ) | wi Xi 
Maths 40 5 200 
Statistics 50 2 100 
Physics 60 4 240 
Chemistry 80 3 240 
Biology 45 1 45 
Total 15 825 

Weighted average: 


=. > WiXi 
z= yw 


= 825/15 =55 marks / subject 
Combined Mean: 


In the arithmetic averages and the number of items in two or 
more related groups are known, the combined or the composite mean 
of the entire group can be obtained by 


_ _ mxitnx2 


i 
. m +m 


The advantage of combined arithmetic mean is that we can 
22 


lum 


ES 


determine the overall mean of the combined data without going back 
to the original data 


Example: 


If a sample size of 22 items has a mean of 15 and another sample size 
of 18 items has a mean of 20. Find the mean of the combined sample? 


Solution: 
a 
= 22x 15+18x 20 
22+ 18 
= 330 +360 =690 = 172.5 
40 40 
Merits of AM 


1. It can be calculated easily and is also easy to understand. 

2. Fluctuation can be minimized 

3. It can further be used for statistical treatement like 
median,mode etc.,. 

4. This method is rigidly defined and hence can be used for 
comparison 


Demerits of AM 
1. It cannot be plotted in a graph. 
2. It is not applicable in qualitative data. 


3. AM cannot be calculated if the class intervals have open 
ends. 


4. It is highly influenced by extreme observations. 
3.3.2 GEOMETRIC MEAN (GM ) 


A geometric mean is a mean or average which shows the 
central tendency of a set of numbers by using the product of their 
values. 


The geometric mean of two numbers, say x, and y is the 
square root of their product xxy. For three numbers, it will be the 
cube root of their products 1.e., (x y z) 1. 


The geometric mean of a series containing n observations is 
the nth root of the product of the values. If xl, x2,...... xn are 
observations then 
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if 7 G. M. =") 1.22 = 


NOTES 
1 
= (x,.x, x)" 
log G.M. = log (x,.x,...x,) 


= (log x, + log x, +... + log x,) 


n 
2 logx; 
B f=3 
7 n 


n 
> log; 


i=l 


n 


G.M. = Antilog 


Example: 


Calculate the geometric mean of the following growth of price of 
onions per 100 Kg per annum is 180, 250, 490, 1400, and 1050 


Xx 180 250 490 1400 |1050 | Total 
log x | 2.2553 | 2.3979 | 2.6902 | 3.1461 | 3.0212 | 13.5107 


Solution: 


n 
> logx; 
i=l 
n 


G.M. = Antilog 


= Antilog 13.5107 
5 
Antilog 2.7021 = 503.6 


Geometrical mean of onion rate is 503.6 


Example: 

Find the geometric mean for the following distribution of student’s 
marks: 

Marks 0-30 | 30 — 50 | 50 — 80 | 80 - 100 
No . of students | 20 30 40 10 
Solution: 
No of Mid 
aa students f points x nig 2 
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0-30 20 15 20 (log 15) = 20(1.1761) = Measures of Central Tendency 
23.5218 NOTES 
30 (log 40) = 30 (1.6020) 
30 — 50 30 40 = 48.0168 
40 (log 65) = 20(1.8129) = 
50 — 80 40 65 72.5165 
80 - 10 (log 90) = 20(1.9542) = 
100 m "i 19.5424 
Total 100 163.6425 
> log x; 
G.M. = Antilog —— 
= Antilog 163.6425 
100 
= Antilog 1.6364 = 503.6 
Geometrical mean of onion rate is 43.29 
Merits of Geometric mean: 
1. Itis strictly defined 
2. It is based on all items 
3. Itis very suitable for averaging ratio, rates and percentages 
4. It is capable of further mathematical treatment 
5. Unlike AM, it is not affected much by the presence of 
extreme values 
Demerits of geometric mean: 
1. It cannot be used when the values are negative or if any of the 
observations is zero 
2. It is difficult to calculate particularly when the items are very 
large or when there is a frequency distribution 
3. It brings out the property of the ratio of the change and not the 
absolute difference of change as the case in arithmetic mean 
4. The GM may not be the actual value of the series 
3.3.3 HARMONIC MEAN 
Harmonic mean of a set of observations is defined as the 
reciprocal of the arithmetic average of the reciprocal of the given 
values. If x,,X2.....X, are n observations. 
A harmonic mean is used in averaging of ratios. The most Self-Instructional Material 
common examples of ratios are that of speed and time, cost and unit 
of material, work and time etc. The harmonic mean (H.M.) of n 
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H. M. = 


> (+) 


17 
i=! 


Example: 


Calculate the harmonic mean of the numbers 13.5, 14.5, 14.8, 15.2 
and 16.1 


Solution: 


The harmonic mean is calculated as below: 


x 1/x 
13.2 0.0758 
14.2 0.0704 
14.8 0.0676 
15.2 0.0658 
16.1 0.0621 

Total 0.3417 
n 
H. M. = 5(+) 
= 5 =1463 
0.3417 


H.M. Discrete Grouped data: 


For a frequency distribution 


H. M. = 


Example: 


The frequency distribution of first year students of a particular 
college, calculate the harmonic mean 


Age (years) | 17 | 18 | 19 | 20 | 21 
2 |5]|13|71J3 


Solution: 
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Age ( years) x | Number of students f | f/x 

17 2 0.1176 
18 5 0.2778 
19 13 0.6842 
20 7 0.3500 
21 3 0.1429 
Total 30 1.5725 


= 30/ 1.5725 = 19.0779 = 19 years 
Merits of H.M: 
1. Itis strictly defined 


2. Itis defined on all observations. 
3. Itis amenable to further algebraic actions 
4. Itis most suitable average when it is desired to give greater 
weight to smaller observations and less weight to larger 
observations. 
Demerits of H.M: 


1. Itis not easily understood. 

2. Itis difficult to calculate. 

3. Itis only an abstract figure and may not be the action of the 
series. 


CHECK YOUR PROGRESS -1 
1. What the 3 measures of central tendency? 
2. What is the formula for arithmetic mean under 
direct method? 
3. Mention 2 merits of geometric mean 
4. What do you mean by harmonic mean? 


2.4 MEDIAN 


The number of students in your classroom, the money your parents 
earns, the temperature in your city is all important numbers. But how 
can you get the information of the number of students in your school 
or the amount earned by the citizen of your entire city? 


The median is that value of the variable which divides the 
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group into two equal parts, one part comprising all values greater and 
the other all values less than median. 


Ungrouped data 
Arrange the given values in the ascending or descending order. 
If the number of value is odd, median is the middle value. 


For example if we have the number of values 12, 15, 21, 27, 35. So 
the numbers are odd then taking the mean as the midpoint 21. 


1) 


th 
Median = a term if n is odd 


If the number of values is even, median is the mean of the middle two 
values. 


For example if we have 12, 15, 21, 27, 35, 40. So the numbers are 
even then taking the mean of the numbers, 


th th 
Median = Mean(“— and aoe terms ) 


So in the above example, take the mean of 21 and 27 and 
divide it by 2 which will give you 24. 


Example: 


The salaries of 8 employees who work for a small company 
are listed below. What is the median salary? 


40,000; 29,000; 35,500; 31,000; 43,000; 30,000; 27,000; 32,000 
Solution: 

Arrange the data in ascending order 

27,000; 29,000; 30,000; 31,000; 32,000; 35,500; 40,000; 43,000 


Since there is an even number of items in the data set, we compute 
the median by taking the mean of the two middlemost numbers. 

th 1 th 4th 5th i 
Mean & and o terms ) = n 
_ 31,000 +32,000 _ 63,000 
eS 
The median salary is 31,500 


= 31,500 


Example: 13 


Find the median of the following set of points in a game: 15, 14, 10, 
8, 12, 8, 16 
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Solution: 
First arrange the values in an ascending order 8, 8, 10, 12, 14, 15, 16 


The number of point values is 7, an odd number. Hence, the median 
is the value in the middle position. 


Median = ( n+1)" term 
2 
= (7+1)"term = 4" 
2 
The median is 12 
Grouped data: 


In grouped distribution, values are associated with 
frequencies. Grouping can be in the form of a discrete frequency 
distribution or continuous frequency distribution. Whatever may be 
the distribution, cumulative frequencies have to be calculated the 
total number of items. 


Cumulative frequency: (cf) 


Cumulative frequency of each class is the sum of the 
frequency of the class and the frequencies of the pervious classes, ie 
adding the frequencies successively, so that the last cumulative 
frequency gives the total number of items. 


When the data follows a discrete set of values grouped by size, we use the 


n+1)th 


formula C item for finding the median. First we form a cumulative 
frequency distribution, and the median is that value which corresponds to 
n+1)t}. ; 

item lies. 


the cumulative frequency in which C 


Example: 14 


The following frequency distribution is classified according to the 
number of students on different branches. Calculate the median number of 
leaves per branch. 


No of Students 1 2 3 4 5 6 7 
Number of Branches | 2 11 15 | 20 25 | 18 | 10 


Solution: 
No of Students | No of Branches | Cumulative Frequency 
x f cf 
1 2 2 
2 11 13 
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3 15 28 
4 20 48 
5 25 73 
6 18 91 
7 10 101 
Total 101 


; . N+1)t}. 
Median = size of CH item 


: 101+1)t} . 
size of CiD item 


= 51" item 
Median = 5 because 51" item corresponds to 5 
Median for continuous grouped data 


In case, the data is given in the form of a frequency table with 
class interval etc, then the following formula is used for calculating 
median in continuous grouped data 


N 
-m 
Median = 1+ — F xc 


Where 1 = Lower limit of the median class 
m = cumulative frequency preceding the median 
c = width of the median class 
f = frequency in the median class 
N = total frequency 
Example: 


Calculate median from the following data 


Cp 0-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 
interval 

Frequency 5 8 10 12 7 6 3 2 
Solution: 

Class Frequency | True class | Cumulative 
interval f interval frequency 


cf 
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0-4 5 0.5-4.5 5 
5-9 8 4.5-9.5 13 
10-14 10 9.5 -14.5 23 
15-19 12 14.5 - 19.5 35 
20-24 7 19.5 - 24.5 42 
25-29 6 24.5 - 29.5 48 
30-34 3 29.5 - 34.5 51 
35-39 2 34.5 - 39.5 53 
53 
T= = 26.5 


Here the cumulative frequency is greater than or equal to 26.5 is 14.5 


7 


Median = [+ 1 ké 

1 =14.5 

N/2 = 26.5 

m =23 

f =12 

= 14.5 + (26.5 — 23) x5 = 14.5 + 1.46 = 15.96 

12 

Merits of Median: 


1. Median is not influenced by extreme values because it is a 
positional average. 

2. Median can be calculated in case of distribution with open 
end intervals. 

3. Median can be located even if the data are incomplete. 

4. Median can be located even for qualitative factors such as 
ability, honesty etc. 


Demerits of Median: 


1. A slight change in the series may bring drastic change in 
median value 

2. In case of even number of items or continuous series, median 
is an estimated value other than any value in the series. 

3. It is not suitable for further mathematical treatment except its 
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use in mean deviation. 
4. Itis not taken into account all the observation. 


2.5 MODE 


The mode is the most frequently occurring values or scores.The 
mode is useful when there are a lot of repeated values. There can be 
no mode, one mode, or multiple modes. 


Its importance is very great in marketing studies where a manager is 
interested in knowing about the size, which has the highest 
concentration of items. For example, in placing an order foot shoes 
or ready-made garments the model size helps because the sizes and 
other sizes around in common demand. 


Ungrouped Data: 


For ungrouped values or a series of individual observation mode 
is often found by mere inspection 


Example: 


Find the mode for the following list of values: 
13,18,13,14,13,16,14,21,13 


Solution: 
The mode is the number that is repeated more often than any other 
Therefore the Mode = 13 


In some cases the mode may be absent while in some cases there 
may be more than one mode. 


Example: 
Ms.Rossy asked students in her class how many siblings they each has. 
Find the mode of the data : 0,0,0,1,1,1,1,2,2,2,2,3,3,4 
Solution: 
The modes are 1 and 2 siblings 
Grouped Data 


For Discrete distribution, the highest frequency and corresponding value 
of X is mode. 


Continuous distribution: 


Mode = L + ( Si = Jo z ) xh 
2fi — fo Egs fo, 


Where L is the lower class limit of the modal class 


fiis the frequency of the modal class 
32 


fo is the frequency of the class preceding the modal class 
in the frequency table 


fois the frequency of the class succeeding the modal class 
in the frequency table 


h is the class interval of the modal class 


Example:18 

Calculate mode for the following: 

F 0-50 | 50-100 | 100-150 | 150-200 | 200-250 | 250-300 | 300-350 | 350-400 a ang 
f |5 14 40 91 450 87 60 38 15 
Solution: 


The highest frequency is 450 and corresponding class interval in 200 — 
250, which is the modal class 


Here L = 200, fı = 150, fo=91, f2=87, h=50 


Mode = L + ( Si = Jo - ) x h 
2fi— fo— Fs; 


= 200 + 150-91 x 50 
2x 150-91-87 
= 2450 = 200 + 24.18 = 224.18 
122 


Example: 19 


Find the modal class and the actual mode of the data set below 


Number 1-3 | 4-6 | 7-9 | 10-12 | 13-15 | 16-18 | 19-21 | 22-24 | 25-27 | 28-30 


Frequency 7 6 |4 I2 2 8 1 2 3 2 


Solution: 
Modal class = 10 — 12 


Mode = L + ( h = fo ) . h 
2fi — fo 7 fo, 


Here L = 10, fi = 9, fo= 4, fp = 2, h=3 
=10+ 9-4 x3 


2x9-2-4 
=10+5 x3 =10+1.25 =11.25 
12 
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Mode = 11.25 
Merits of mode: 


1. It is easy to calculate and in some cases it can be located mere 
inspection. 

2. Mode is not at all affected by extreme values 

3. It can be calculated for open-end classes 

4. It is usually an actual value of an important part of the series 

5. In some circumstances it is the best representative of data 


Demerits of mode: 


1. It is not based on all observation 

2. It is not capable of further mathematical treatment 

3. Mode is ill defined generally it is not possible to find mode in 
some cases. 

4. As compared with mean, mode is affected to a great extent by 
sampling fluctuations 


It is unsuitable in cases where relative importance of items has to be 
considered. 


2.6 PARTITION MEASURES 
2.6.1 QUARTILES 


The quartiles divide the distribution in four parts. There are three 
quartiles denoted by Q1, Q2 and Q3 divides the frequency distribution in 
to four equal parts 


That is 25% of data will lie below Q1, 50% of data below Q2 and 
75percent below Q3. Here Q2 is called the Median. Quartiles are 
obtained in almost the same way as median. 


Ungrouped Data: 


If the data set consist of n items and arranged in ascending order then 


Q= (L) item, Q,= (H) item and Q,=3 (ai1}" item 
Example:20 


Compute quartiles for the data 25, 18, 30, 8, 15, 5, 10, 35, 40, 45. 
Solution: 


th th 
Q1 = item =item = (2.75)" item 


: 3yra : f 
= 2" item + o- (3" item - 2" item) 
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=8+9 (10-8) =8+15 
Q1= 9.5 
Q3 =3 CD" item =O item = 3 x (2.75) item 
= (8.25) item 
= 2" item + (9" item - 8" item) 
=35 +2 (40-35) =35 + 1.25 
Q3 = 36.25 


Continuous series: 


In the case of continuous series, find the cumulative frequency and 
then use the interpolation formula. 


e Find Cumulative frequencies 

e FindN/4 

e QI class is the class interval corresponding to the value of the 
cumulative frequency just greater than N / 4 

e Q3 class is the class interval corresponding to the value of the 
cumulative frequency just greater than 3 N / 4 


= — m 
Á 


3X )- m3 


Qı = h + h oe) 


xq and Q;=i,+ 


Where N=2f = total of all frequency values 
lı = lower limit of the first quartile class 
fı = frequency of the first quartile class 
cı = width of the first quartile class 


m, = cumulative frequency preceding the first quartile 


class 
l; = lower limit of the 3rd quartile class 
fs = frequency of the 3rd quartile class 
m3 = cumulative frequency preceding the 3rd quartile 
class 
c3 = width of the third quartile class 
Example: 


The marks secured by group of students in their internals. 
Class 10 - 20 | 20 - 30 | 30- 40 | 40-50 | 50 - 60 


Frequency | 4 3 2 1 5 
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Solution: 
Class Frequency f | Cumulative frequency cf 
10 - 20 4 4 
20 - 30 3 7 
30 - 40 2 9 
40 - 50 1 10 
50 - 60 5 15 


N/4=15/4= 3.75 which lies in 10 — 12 
Lies in the group 10 — 20 


— —m 
. I 
Q, = 4+ p ží 


=10+85 x 


10 = 10 +9.38 = 19.38 
3N/4= 3x15/4= 11.25 which lies in 50 -60 


Therefore Q3 lies in the group 50 — 60 


3 mM 
Qz =h + ` = X63 
h 
= 50+ (11.25—-10) x10 
5 
= 50 + 2.5 = 52.5 
2.6.2 DECILES 


These are the values which divide the total number of observation 
into 10 equal parts. They are Dı, D2, D3, D4, Ds, Do, D7, Dg, Do and Dio. 


Ungrouped Data: 
Example: 
Compute the D7 for the data: 5, 24, 36, 12, 20, and 8. 
Solution: 
Arranging the given data in the ascending order 5,8,12,20,24,36 


th th 
10 10 


= 3" item +1 (4" item - 3" item) 


=124+%(20-12)=124+4 =16 


D5 = observation = ( 3.5)" observation 
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Grouped Data: 
Example: 
Calculate the Dı and D7 for the given data 
Class interval | 0 -10 | 10-20 | 20-30 | 30-40 | 40-50 | 50-60 | 60-70 


Frequency 5 7 12 16 10 8 4 
Solution: 
Class interval | Frequency f | Cumulative frequency cf 
0 -10 5 5 
10-20 7 12 
20-30 12 24 
30-40 16 40 
40-50 10 50 
50-60 8 58 
60-70 4 62 
D4 = (4N / 10)" item = (4 x 62/10)" item = (24.8)" item 


This lies in the interval 30 — 40 


D4 =l1+ ae xc 
=30 + = x10 = 30 +" (0.8) x 10 
=30+0.5 =30.5 
D; = (7N / 10)" item 
=(7x62 / 10)" item 
= (43.4)" item 
This lies in the interval 40 — 50 
D; =1 ae xc 
=40 +4 x10 =30 +8 x10 


=40+3.4 =43.4 
2.6.3 PERCENTILE 


The percentile values divide the distribution into 100 parts each 
containing 1 percent of the cases. The percentile (Px) is that value of the 
variable upto which lie exactly k% of the total number of observation 


Relationship 
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P25 = Qi 


Ps9= Median = Q2 
P75 = 3rd quartile = Q3 


Ungrouped Data: 
Example: 24 


The monthly income ( in 


1000) of 8 persons working in a factory. Find 


P39 income value 17, 21,14,36,10,25,15,29 


Solution: 


Arrange the data in the increasing order : 10, 14, 15, 17, 21, 25, 29, 36 


n=8 


P39 = (30 (n + 1) )" item 


100 


= ( 30 (8 + 1) )" item 


100 


=(30x9)"item =2.7" item 


100 


= 2" item +0.7( 3™ items - 2™ items) 


= 14+0.7 (15 -14) 


=14+0.7 
P39 = 14.7 
Grouped Data: 


Example: 25 


Find Ps3 for the following frequency distribution. 


Class 0-5 
interval 


5-10 | 10-15 | 15-20 | 20-25 | 25-30 | 30-35 | 35-40 


Frequency | 5 


12 |16 |20 |10 |4 3 


Solution: 


Class interval 


Frequency | Cumulative frequency 


0-5 


5 5 
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5-10 8 13 
10-15 12 25 
15-20 16 41 
20-25 20 61 
25-30 10 71 
30-35 4 75 
35 - 40 3 78 
Total 78 


Ps3 =1+(53N/10—m) xc 
f 
= 20 + (41.34 - 41) x 5= 20 + 0.335 = 20.335 
20 


2.7 MEASURES OF DISPERSION 

Dispersion is the extent till which a distribution can be stretched 
or squeezed. We can understand variation with the help of the following 
example: 


Series I Series II | Series I 
10 2 10 
10 8 12 
10 20 8 
ÈX =30 30 30 


In all three series, the value of arithmetic mean is 10. On the basis 
of this average, we can say that the series are alike. If we carefully 
examine the composition of three series, we find the following 
differences: 


(i) In case of Ist series, three items are equal; but in 2nd and 3rd 
series, the items are unequal and do not follow any specific 
order. 

(ii) The magnitude of deviation, item-wise, is different for the Ist, 
2nd and 3rd series. But all these deviations cannot be 
ascertained if the value of simple mean is taken into 
consideration. 
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(ii) In these three series, it is quite possible that the value of 
arithmetic mean is 10; but the value of median may differ from 
each other. This can be understood as follows; 


Series I Series II | Series II 


10 2 8 


10 median | 8 median | 10 median 


10 20 12 


SX =30 30 30 


The value of Median’ in 1st series is 10, in 2nd series = 8 
and in 3rd series = 10. Therefore, the value of the Mean and 
Median are not identical. 


(iv) As the average remains the same, the nature and extent of the 
distribution of the size of the items may vary. In other words, 
the structure of the frequency distributions may differ even 
though their means are identical. 


2.7.1 PROPERTIES OF A GOOD MEASURE OF DISPERSION 
There are certain pre-requisites for a good measure of dispersion: 


1. It should be simple to understand. 

2. It should be easy to compute. 

3. It should be rigidly defined. 

4. It should be based on each individual item of the distribution. 
5. It should be capable of further algebraic treatment. 
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2.7.2 CHARACTERISTICS OF MEASURES OF DISPERSION 


A measure of dispersion should be rigidly defined 

It must be easy to calculate and understand 

Not affected much by the fluctuations of observations 
Based on all observations 


2.7.3 CLASSIFICATION OF MEASURES OF DISPERSION 


The measure of dispersion is categorized as: 


(i) An absolute measure of dispersion: 


It involves the units of measurements of the observations. For 
example, (i) the dispersion of salary of employees is expressed in rupees, 
and (ii) the variation of time required for workers is expressed in hours. 
Such measures are not suitable for comparing the variability of the two 
data sets which are expressed in different units of measurements 
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(ii) A relative measure of dispersion: 


It is a pure number independent of the units of measurements. 
This measure is useful especially when the data sets are measured in 
different units of measurement 


For example, a nutritionist would like to compare the obesity of school 
children in India and Africa. He collects data from some of the schools in 
these two countries. The weight is normally measured in kilograms in 
India and in pounds in Africa. It will be meaningless, if we compare the 
obesity of students using absolute measures. So it is sensible to compare 
them in relative measures. 


2.8 RANGE 


Raw Data:A range is the most common and easily understandable 
measure of dispersion. It is the difference between the largest and 
smallest observations in the data set 


Range(R)=L-S 


Grouped Data:The grouped frequency distribution of values in the data 
set, the range is the difference between the upper class limit of the last 
class interval and the lower class limit of the first class interval. 


Coefficient of Range: The relative measure of range is called the 
coefficient of range 


Coefficient of range = (L-S) / (L +S) 
Example: 


Find the value of range and its coefficient for the following data 49, 81, 
36, 64, 121, 100. 


Solution: 
L=121 : S=36 
Range : L- S = 121 —36=85 
Co-efficient of Range = (L-S) / (L+S) = 121-36 /121+36 
= 85 /157=0.5414 


Example: 


Calculate range and its coefficient from the following distribution. 


x 10- 15 15-20 20-25 25 - 30 


Frequency 4 10 16 8 
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Solution: L = 30, S = 10 
Range = L - S =30-10=20 
Coefficient of Range = (L-S) / (L+S) = 30 - 10/30 + 10 
= 20/ 40= 0.5 
Merits of Range 


e Itis the simplest of the measure of dispersion 
Easy to calculate 

e Easy to understand 

Independent of change of origin 


Demerits of Range 


e It is based on two extreme observations. Hence, get affected by 
fluctuations 

e A range is not a reliable measure of dispersion 

e Dependent on change of scale 


2.9 QUARTILE DEVIATION 


The quartiles divide a data set into quarters. The first quartile, 
(Q1) is the middle number between the smallest number and the median 
of the data. The second quartile, (Q2) is the median of the data set. The 
third quartile, (Q3) is the middle number between the median and the 
largest number. Quartile deviation is half of the difference between the 
first and third quartiles. Hence it is called as Semi Inter Quartile Range 


Quartile deviation or semi-inter-quartile deviation is 
Q = x (Q3 - Q1) 
Coefficient of Quartile Deviation 
Coefficient of Q.D = Q3 -Q1 / Q3 + Q1 
Merits of Quartile Deviation 


e All the drawbacks of Range are overcome by quartile deviation 
e It uses half of the data 

e Independent of change of origin 

e The best measure of dispersion for open-end classification 


Demerits of Quartile Deviation 


e Itignores fifty percent of the data 
e Dependent on change of scale 
e Nota reliable measure of dispersion 


Example: 
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Calculate the quartile deviation and its coefficient for the wheat 
production (in Kg) of 25 acres is given as : 1120, 1240, 1320, 1040, 
1080, 1200, 1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 
1720, 1600, 1470, 1750 and1885. 


Solution: Arrange the observation in increasing order: 


1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 
1600, 1680, 1720, 1730, 1750, 1755, 1785, 1880, 1885, 1960. 


Q1 = value of (n+1) / 4 th item 
= value of (20 +1) /4 th item = value of (5.25)th item 
= 5th item + 0.25 ( 6th item — 5th item) 

= 1240 + 0.25 (1320 — 1240) 
= 1240+ 20 = 1260 

Q1 = 1260 

Q3 = value of 3(n+1) / 4 th item 
= value of 3(20 +1) /4 th item = value of (15.75)th item 
= 15th item + 0.75 ( 16th item —15th item) 

= 1750 + 0.75 (1755 — 1750) 
= 1750 + 3.75 = 1753.75 


Q3 = 1753.75 
Q.D =(Q3-—QI1)/2 =(1753.75 — 1260) /2 = 492.75 /2 
= 246.875 


Coefficient of QD = (Q3 - Q1) / (Q3+Q1) 
= (1753.75 — 1260) / (1753.75 + 1260) 
= 0.164 


2.10 MEAN DEVIATION 


The average deviation, it is defined as the sum of the deviations 
from an average divided by the number of items in a distribution The 
average can be mean, median or mode. Theoretically median is d best 
average of choice because sum of deviations from median is minimum, 
provided signs are ignored. However, practically speaking, arithmetic 
mean is the most commonly used average for calculating mean deviation 
and is denoted by the symbol MD. 


Mean Deviation is of three types of series: 


e Individual Data Series 
e Discrete Data Series 
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e Continuous Data Series 


Individual Data Series: For individual series, the Mean Deviation can 
be calculated using the following formula. 


1 
MD = =Y |X -Al = aa 
Where 
MD = Mean deviation. 
X = Variable values 
A = Average of choices 
N = Number of observations 
Coefficient of Mean Deviation: 


Mean deviation calculated by any measure of central tendency is 
an absolute measure. The purpose of comparing variation among 
different series, a relative mean deviation is require. The relative mean 
deviation are obtained by dividing the mean deviation by the average 
used for calculating mean deviation 


The Coefficient of Mean Deviation can be calculated using 
MD 
Coefficient of MD = PE 
Example: 
Calculate mean deviation and coefficient of mean deviation for the 
following individual data: 


Items 28 72 90 140 210 
Solution: 
284+ 72+90+ 140+ 210 540 


Item X | Deviation |D| 


28 80 
72 36 
90 18 
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140 32 
210 102 
L|D|= 268 
Mean Deviation = MD = —Y 1X e POS se 
ean Deviation = = N = N = 5 = z 
Coefficient of Mean Deviati omer ry 
oetricient o ean Veviation = A = 108 = U. 


Discrete Data Series 


For discrete series, the Mean Deviation can be calculated using 


2 f|x-Me| _ Xf |D] 


MD = 


N 


N 


Where, N = Number of observations. 


f = Different values of frequency f. 


x = Different values of items. 


Me = Median. 


Coefficient of Mean Deviation 


The Coefficient of Mean Deviation can be calculated using the 
following formula. 


Coefficient of MD = 


MD 


Me 
Example: Calculate the mean deviation and for the following discrete 


data 
Items 42 | 108 | 135 | 150 | 210 
Frequency | 6 | 15 | 3 3 9 

Solution: 

X; | Frequency fi | fixi | [xi— Mel fi |xi — Me| 

42 6 252 93 558 

108 15 1620 27 405 

135 3 405 0 
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150 3 550 15 45 


210 9 1890 75 675 


2 f; Ix; = Mel = 1683 


(N + 1)th item = (5 + 1)th item _ 6th item 


Median = > 7 5 = 3rditem 
= 135 
Mean Deviation == ZEIK Mel = 21 |D| = 2 = 46.75 
N N 36 
= MD 46.75 
Coefficient of MD = Me gB, > 0.3463 


Continuous Data Series 


The method of calculating mean deviation in a continuous series 
is same as the discrete series. In continuous series, find a midpoint of the 
various classes and take deviation of these points from the average 
selected 

2 f|x-Me| _ dFID| 
CON N 
Where N = Number of observations. 


MD 


f = Different values of frequency f. 
x = Different values of items. 
Me = Median. 


Coefficient of Mean Deviation 


The Coefficient of Mean Deviation can be calculated using the 
following formula. 
MD 
Coefficient of MD = — 
Me 
Example: 


Find out the mean deviation from the given data 
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Age in years | 0-10 | 10-20 | 20-30 | 30-40 | 40-50 | 50-60 | 60-70 | 70-80 
No of persons | 40 | 50 | 64 | 80 | 82 | 70 | 20 | 16 
Solution: 
Mid | fi| Mid 
Item Frequenc i xi- i [xi — Item Frequenc i 
s poin y fi fixi Mel Mel s poin y fi fii 
txi txi 
0- 31.4 | 1258. 0- 
10 5 40 200 7 8 10 5 40 200 
10- 21.4 | 1073. | 10- 
20 15 50 750 7 5 20 15 50 750 
20- 11.4 | 734.0 | 20- 
30 25 64 1600 7 8 30 25 64 1600 
30- 30- 
40 35 80 2800 1.47 | 117.6 40 35 80 2800 
40- 776.5 | 40- 
50 45 82 3690 | 9.47 4 50 45 82 3690 
50- 19.4 | 1362. | 50- 
60 55 70 3850 7 9 60 55 70 3850 
60- 29.4 60- 
70 65 20 1300 7 589.4 70 65 20 1300 
70- 39.4 | 631.5 | 70- 
80 75 16 1200 7 2 80 75 16 1200 
x f; [xi 
2 fix; a Me| 
N = 422 | =1539 6544.3 
0 4 
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Z fixi 15390 


Median =en a 36.47 
he Yflx—Mel XfID| 6544.34 
Mean Deviation == SS = OM a 15.5079 
ne MD 15.5079 
Coefficient of MD = Meo sea 0.4252 


Merits of Mean Deviation: 


e [tis simple to understand and easy to compute. 

e [tis based on each and every item of the data. 

e MD is less affected by the values of extreme items than the 
Standard deviation. 


Demerits of Mean Deviation: 


e The greatest drawback of this method is that algebraic signs are 
ignored while taking the deviations of the items. 

e It is not capable of further algebraic treatments. 

e Itis much less popular as compared to standard deviation. 


2.11 STANDARD DEVIATION 


The concept of Standard Deviation was introduced by Karl 
Pearson in 1893. It is by far the most important and widely used measure 
of dispersion. Its significance lies in the fact that it is free from those 
defects which afflicted earlier methods and satisfies most of the 
properties of a good measure of dispersion. Standard Deviation is also 
known as root-mean square deviation as it is the square root of means of 
the squared deviations from the arithmetic mean. 


The standard deviation is defined as the positive square root of 
the mean of the square deviations taken from the arithmetic mean of the 
data 


Ungrouped data 


xl , x2 , x3 ... xn are the ungrouped data then standard deviation is 
calculated bythere are two methods of calculating standard deviation in 
an individual series 


e Actual mean method 
e Assumed mean method 


Actual Mean Method: 


JAQ — 2 


n 


Standard deviationo = 


Example: 


Calculate the standard deviation from the following data 28, 44, 
18, 30, 40, 34, 24, 22. 
48 


Solution: 


Deviations from actual mean 


Values (X) | X-X | (K-X/)? 
28 -2 4 
44 -14 196 
18 -12 144 
30 0 0 
40 10 100 
34 4 16 
24 -6 36 
22 -8 64 
240 560 
re S30 
8 
o sa ee 70 = 8.3666 
Assumed Mean Method 


This method is used when the arithmetic mean is fractional value. 
Taking deviations from fractional value would be a very difficult and 
tedious task. To save time and labour a short cut method is used; 
deviations are taken from a assumed mean. 


N Id?  /Zd\? 
Standard Deviation o = on (—) 


Example: 


The marks obtained by the college students in statistics. Using the 
following data calculate standard deviation. 
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Students No:| 1 | 2 | 3 |4/,5 }/6/]7]8)9 1) 10 
Marks 53 | 58 | 46 | 67 | 32 | 70 | 35 | 68 | 88 | 99 
Solution: Deviations from assumed mean 
Students No | Marks ( x) d=X-—A (A=67) a 
1 53 -14 196 
2 58 -9 81 
3 75 8 64 
4 67 0 0 
5 32 -35 1225 
6 70 3 9 
7 35 -32 1024 
8 68 1 1 
9 88 21 441 
10 69 2 4 
n=10 £d = -55 md’ = 
3045 


3045 7—55\" 
z ee (=) = /3045 — 30.25 = V274.25 


= 16.5605 


2.11.1 CALCULATION OF STANDARD DEVIATION 


Discrete series: There are three methods for calculating standard 


deviation in discrete series. They are 


a) Actual mean method 


b) Assumed mean method 
c) Step deviation method 
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Actual mean method 


Calculate the mean of the series. Find the deviations for various 
items from the means and square the deviations and multiply by the 
respective frequency and total the product the formula to calculate actual 
mean method is 


If the actual mean is fractions, the calculation takes lot of time 
and labour; and as such this method is rarely used in practice 


Assumed mean method 


Here deviation is taken not from an actual mean but from an 
assumed mean. Also this method is used, if the given variable values are 
not in equal intervals. 


Zd? Zd 
o= za _ (aay where d= X—A, N=2f 


Example: 


Calculate standard deviation from the following data: 


Solution: 


Deviation from assumed mean 


x| f |d=X-A} @ fd fd’ 
(A=31) 

20| 5 -11 [121| -55 605 
22| 12 -9 |81| -108 972 
25 5 -6 |36| -90 540 
31] 20 0 0 0 0 

35 £25 4 16 | 100 400 
40| 14 9 81 126 1134 
42] 10 11 |121| 110 1210 
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Measures of Central Tendency 45 6 14 196 84 504 


NOTES 
N= 107 “fd = 167 | £ fd? = 5365 
Curriculum =, zf? zf? 5365 _ 167)" _ J506 Z274 = 
o= FO) = Joe - (2) =v5016-244 = 6.91 
NOTES Step — deviation method: 
If the variable values are in equal intervals, then we adopt this 


method 


ene Sfd2 “fd \2 
Standard Deviation o = am (=) XC 


Example: 


The frequency distribution of marks in mathematics given in the table 


Marks 30 | 40 | 50 | 60 | 70 | 80 | 90 


No of students | 8 | 12 120| 10171312 


Solution: 
Marks x f d= (x-50)/ 10 fd fd’ 
30 8 9) -16 32 
40 12 -1 T2 12 
50 20 0 0 0 
60 10 1 10 10 
70 7 2 14 28 
80 3 3 9 27 
90 2 4 8 32 
N=62 “fd = 13 Ifd’ = 141 
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a fd? /Zfd\? 
Standard Deviation o = EE (F) XC 


a. 13)" v0 = 1.4934X10 = 14.934 
— | 62 (5) Da Ba 


Combined Mean and Combined Standard Deviation 


Combined arithmetic mean can be computed if we know the 
mean and number ofitems in each group of the data. 


X1, X2, O1, O2 are mean and standard deviation of two data sets having nı 
and m asnumber of elements respectively. 


. — x,+MxX> : 
combined mean X = a Ta (if two data sets) 
— _ mX,+n,x,+n,Xx, : 
x. = n +n, +n, (if three data sets) 


Example: 


Particulars regarding income of two company are given below: 


Company 
A B 
No.of Employees 600 | 500 
Average income 1500 | 1750 
Standard deviation of income | 10 9 


Compute combined mean and combined standard deviation. 
Solution: 
Given nı = 600 ; x; = 1500 ; o= 10 
m = 500 ; X2 = 1750; 02 =9 
_ 600 x 1500 + 500 x 1750 _ 900000 + 875000 


600 + 500 1100 
= 1613.6363 


Combined Standard Deviation: 
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_ | mor +d:)+m(o.+d) 
On ~ | n +N, 


dı = X12 - X; = 1613.6363 -1500 = 113.6363 
d2 = X12 -X2 = 1613.6363 — 1750 = -136.3637 


g, — | $00. 400 + 12913 209) +500 (81+18595 0587) 
12 ~~ 600 + 500 


= 124.8488 
Merits of Standard Deviation: 


Among all measures of dispersion Standard Deviation is 
considered superior because it possesses almost all the requisite 
characteristics of a good measure of dispersion. It has the following 
merits: 


e Itis rigidly defined. 

e It is based on all the observations of the series and hence it is 
representative. 

e Itis amenable to further algebraic treatment. 

e Itis least affected by fluctuations of sampling. 


Demerits: 


e Itis more affected by extreme items. 

e It cannot be exactly calculated for a distribution with open-ended 
classes. 

e Itis relatively difficult to calculate and understand. 


2.12 COEFFICIENT OF VARIATION 


The coefficient of variation (CV) is a statistical measure of the dispersion 
of data points in a data series around the mean. The coefficient of 
variation represents the ratio of the standard deviation to the mean, and it 
is a useful statistic for comparing the degree of variation from one data 
series to another, even if the means are drastically different from one 
another. 


Coefficient of Variation = (Standard Deviation / Mean) * 100. 
fo} 
CV = (=) X 100 


The coefficient of variation (CV) is a measure of relative 
variability. It is the ratio of the standard deviation to the mean (average). 
For example, the expression “The standard deviation is 15% of the mean 
is a CV. 


The CV is particularly useful when you want to compare results 
from two different surveys or tests that have different measures or values. 
For example, if you are comparing the results from two tests that have 
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different scoring mechanisms. If sample A has a CV of 12% and sample 
B has a CV of 25%, you would say that sample B has more variation, 


relative to its mean. 


Example: 


Price of car in five years in two cities is given below: 


Price in city A 


Price in city B 


20,00000 10,00000 
22,00000 20,00000 
19,00000 18,00000 
23,00000 12,00000 
16,00000 15,00000 


Which city has more stable prices? 


Solution: 
City A City B 
Price X Deviation dx? Price Y Deviation dy” 
(in lakhs) x= 20 (in lakhs) y=15 
dx dx 
20 0 0 10 -5 25 
22 2 4 20 5 25 
19 -1 1 18 3 9 
23 3 9 12 -3 9 
16 -4 16 15 0 0 
xx=100 | Xdx=0 | £d | Ly=75 | Ydy=0 Edy? 
=30 =68 


City A: x = &x/n =100/5=20 
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_ Ra=0 K)? pE -aas 


C.V.(X) = ( (2) x 100 = = x 100 = 12, 25% 


City B: X= Ix/n =75/5=15 


_ Pome a za - [B-a 


c.v. (Y) = (= ) x 100 = =" x 100 = 24.6% 
y 15 


City A had more stable prices than City B, because the coefficient of 
variation is less in City A. 


CHECK YOUR PROGRESS - 2 
5. What is the coefficient of mean deviation? 
6. What does standard deviation mean? 


7. What is a measure of relative variability? 


2.13 SUMMARY 


Measures of central tendency fail to reveal the degree of the 
spread out or the extent of the variability in individual items of 
the distribution. Dispersion is the extent till which a distribution 
can be stretched or squeezed 

A range is the most common and easily understandable measure 
of dispersion. It is the difference between the largest and smallest 
observations in the data set. 


Coefficient of range = (L-S) / (L +S) 


The quartiles divide a data set into quarters. The first quartile, 
(Q1) is the middle number between the smallest number and the 
median of the data. The third quartile, (Q3) is the middle number 
between the median and the largest number. Quartile deviation is 
half of the difference between the first and third quartiles. Hence 
it is called as Semi Inter Quartile Range 

The average deviation, it is defined as the sum of the deviations 
from an average divided by the number of items in a distribution 
The average can be mean, median or mode. 

The standard deviation is defined as the positive square root of 
the mean of the square deviations taken from the arithmetic mean 
of the data. 
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e The coefficient of variation (CV) is a statistical measure of the 
dispersion of data points in a data series around the mean. 


2.14 KEY WORDS 


Measures of dispersion, range quartile deviation, mean deviation, 
standard deviation, and coefficient of variable. 


2.15 ANSWERS TO CHECK YOUR PROGRESS 


1. Dispersion is the extent till which a distribution can be stretched 
or squeezed. 

2. The grouped frequency distribution of values in the data set, the 

range is the difference between the upper class limit of the last 

class interval and the lower class limit of the first class interval. 

Coefficient of Q.D = Q3 -Qı /Q3+Q: 

All the drawbacks of Range are overcome by quartile deviation 

Coefficient of MD = ae 


The standard deviation is defined as the positive square root of 
the mean of the square deviations taken from the arithmetic mean 
of the data. 


7. Coefficient of variation (CV) CV = (5) X100 


DRE 


2.16 QUESTIONS AND EXERCISE 


SHORT ANSWER QUESTIONS 


e What is dispersion? How is it advantageous over the measures of 
central tendency? 

e Write short notes on range 

e What is Coefficient of variation? Explain 

e Write about mean deviation 


LONG ANSWER QUESTIONS 


e Calculate mean deviation under assumed mean method 


Marks 30 | 40 | 50 | 60 | 70 | 80 | 90 


No of students | 16 | 24 | 40 12011416 |4 


e Give a detained account on standard deviation. 

e Calculate the quartile deviation and its coefficient for the corn 
production (in Kg) of 25 acres is given as: 1100, 1340, 1370, 
1050, 1780, 1200, 2440, 1390, 1480, 1780, 1783, 1542, 1970, 
1680, 1775, 1320, 1680, 1770, 1780 and1889. 
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UNIT Il - PROBABILITY 


Structure 
3.0 Introduction 


3.1 Objectives 
3.2 Importance Terms 
3.3Types of Probability 
3.4 Basic relationship of Probability 
3.5 Addition Theorem of Probability 
3.6 Multiplication Theorem of Probability 
3.7. Condition Probability 

3.7.1 Combined Use Of Addition And Multiplication Theorem 
3.8 Baye’s Theorem and its application 
3.9 Summary 
3.10 Key Words 
3.11 Answer to Check Your Progress 
3.12 Questions and Exercise 


3.3 Further Readings 
3.0 INTRODUCTION 


In our day to day life the “probability” or “chance” is very 
commonly used term. Sometimes, we use to say “Probably it may rain 
tomorrow”, “Probably Mr. X may come for taking his class today’, 
“Probably you are right”. All these terms, possibility and probability 
convey the same meaning. But in statistics probability has certain special 
connotation unlike in Layman’s view. 

The theory of probability has been developed in 17th century. It 
has got its origin from games, tossing coins, throwing a dice, drawing a 
card from a pack. In 1954 Antoine Gornband had taken an initiation and 
an interest for this area. 


After him many authors in statistics had tried to remodel the idea 
given by the former. The “probability” has become one of the basic tools 
of statistics. Sometimes statistical analysis becomes paralyzed without 
the theorem of probability. “Probability of a given event is defined as the 
expected frequency of occurrence of the event among events of a like 
sort.” (Garrett) 


The probability theory provides a means of getting an idea of the 
likelihood of occurrence of different events resulting from a random 
experiment in terms of quantitative measures ranging between zero and 
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one. The probability is zero for an impossible event and one for an event 
which is certain to occur. 


3.1 OBJECTIVES 


The students will be able to understand 
e The important terms in probability 
e Concept of conditional probability, addition theorem and 
multiplication theorem. 
e Baye’s theorem and its applications 


3.2 IMPORTANT TERMS 


1. Probability or Chance: Probability or chance is a common term 
used in day-to-day life. For example, we generally say, 'it may 
rain today'. This statement has a certain uncertainty. Probability is 
quantitative measure of the chance of occurrence of a particular 
event. 

2. Experiment: An experiment is an operation which can produce 
well-defined outcomes. 


3. Random Experiment: If all the possible outcomes of an 
experiment are known but the exact output cannot be predicted in 
advance, that experiment is called a random experiment. 
Examples: Tossing of a fair coin: When we toss a coin, the 
outcome will be either Head (H) or Tail (T) 


4. Trial : Any particular performance of a random experiment is 
called trial 
Example: Tossing 4 coins, rolling a die, picking ball from a bag 
containing 10 balls of which 4 is red and 6 is blue. 


5. Event :Any subset of a Sample Space is an event. Events are 
generally denoted by capital letters A, B , C, D etc. 
Examples: 
i. When a coin is tossed, outcome of getting head or tail is 
an event 
Types of Events: 


e Simple Events: In the case of simple events, we take 
the probability of occurrence of single events. 


Examples: Probability of getting a Head (H) when a coin 
is tossed 


e Compound Events: In the case of compound events, 
we take the probability of joint occurrence of two or 
more events 
Examples: When two coins are tossed, probability of 
getting a Head (H) in the first toss and getting a Tail 
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(T) in the second toss.. Probability 
6. Sample Space :Sample Space is the set of all possible outcomes NOTES 
of an experiment. It is denoted by S. 
Examples : When a coin is tossed, S = {H, T} where H = 
Head and T = Tail 
7. Mutually Exclusive Events: Two or more than two events are 
said to be mutually exclusive if the occurrence of one of the 
events excludes the occurrence of the other 
Example :When a coin is tossed, we get either Head or Tail. 
Head and Tail cannot come simultaneously. Hence occurrence of 
Head and Tail are mutually exclusive events. 


8. Equally Likely Events: Events are said to be equally likely if 
there is no preference for a particular event over the other. 


Examples: When a coin is tossed, Head (H) or Tail is equally 
likely to occur. 


9. Independent Events: Events can be said to be independent if the 
occurrence or non-occurrence of one event does not influence the 
occurrence or non-occurrence of the other. 


Example: 


i. When a coin is tossed twice, the event of getting Tail(T) in 
the first toss and the event of getting Tail(T) in the second 
toss are independent events. This is because the 
occurrence of getting Tail(T) in any toss does not 
influence the occurrence of getting Tail(T) in the other 
toss. 

10. Exhaustive Events: Exhaustive Event is the total number of all 
possible outcomes of an experiment. 


Examples: When a coin is tossed, we get either Head or Tail. 
Hence there are 2 exhaustive events. 


11. Favorable Events: The outcomes which make necessary the 
happening of an event in a trial are called favorable events. 
Examples:if two dice are thrown, the number of favorable events 
of getting a sum 5 is four, i.e., (1, 4), (2, 3), (3, 2) and (4, 1). 


3.3 TYPES OF PROBABILITY 


1. Classical Approach ( Priori Probability): 

According to this approach, the probability is the ratio of 
favorable events to the total no. of equally likely events. In 
tossing a coin the probability of the coin coming down ids 1, of 
the head coming up is 2 and of the tail coming up is 2. 

The probability of one event as ‘P?’ (success) and of the other 
event as ‘q’ (failure) as there is no third event. 
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Number of favourable cases 
p= Total number of equally likely cases 
If an event can occur in ‘a’ ways and fail to occur in ‘b’ ways and 
these are equally to occur, then the probability of the event 
occurring, a/a+b is 
denoted by p. Such probabilities are known as unitary or 
theoretical or mathematical probability. 
p is the probability of the event happening and q is the probability 
of its not happening. 


=" and = 
P a+b q= a+b 
a+b 
Hen +q =— 
ce p+q ae 


Therefore p+q = 1 


Probabilities can be expressed either as ratio,fraction or 
percentage, such as 1⁄2 or 0.5 or 50%.Example: Tossing of a coin. 


Limitations: 

o This definition is confined to the problemsof games of 
chance only and can notexplain the problem other than the 
gamesof chance. 

o This method can not be applied, when theoutcomes of a 
random experiment are notequally likely. 

o The classical definition is applicable onlywhen the events 
are mutually exclusive. 


Relative Frequency Theory of Probability: 


In this approach, the probability of happening ofan event is 
determined on the basis of past experience or on the basis of 
relative frequency of success in the past. 


Example:If a machine produces 100 articles in the past, 2 articles 
were found to be defective, and then the probability of the 
defective articles is 2/100 or 2%. 

The relative frequency obtained on the basis of past experience 
can be shown to come to very close to the classical probability. 


Limitations: 
o The experimental conditions may not remain essentially 
homogeneous and identical in a large number of 
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repetitions of the experiment. Probability 
o The relative frequency m/n, may not attain aunique value 
no matter however large. 
o Probability p(A) defined can never be obtainedin practice. 
We can only attempt at a closeestimate of p(A) by making 
N sufficiently large. 


NOTES 


3. Subjective Approach : 

The subjective approach is also known as subjective 
theory of probability. The probability of an event is considered as 
a measure of one’s confidence in the occurrence of that particular 
event 
This theory is commonly used in business decision making. The 
decision reflects the personality of the decision maker. Persons 
may arrive at different probability assignment because of 
differences in value at experience etc. The personality of the 
decision maker is reflected in a final decision. The decision under 
this theory is taken on the basis of the available data plus the 
effects of other factors many of which may be subjective in 
nature. 

Example:A student would top in B. Com Exam this year. 
A subjective would assign a weight between zero and one to this 
event according to his belief for its possible occurrence. 


4. Axiomatic Approach: 

The probability calculations are based on the axioms. The 

axiomatic probability includes the concept of both classical and 

empirical definitions of probability. 

The approach assumes finite sample spaces and is based on the 

following three axioms: 

1) The probability of an event ranges from 0 to 1.If the event 
cannot take place its probability shall be ‘0’ and if it is 
bound to occur its probability is‘1’. 

ii) The probability of the entire sample space is 1, i.e. p(S)=1. 

iii) If A and B are mutually exclusive events then the 
probability of occurrence of either A or B denoted 
byp(AUB) = p(A) + p(B) 

iv) If A and B are happening together events then the 
probability of occurrence of probability of A intersection 


B denoted by p (AN B) = p(A) . p(B) 


3.4 BASIC RELATIONSHIPS OF PROBABILITY 


There are some basic probability relationships that can be used to 
compute the probability of an event without knowledge of all the 
sample point probabilities. 

Self-Instructional Material 
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Complement of an 
Event:The complement of 
any even A is the even (not 
A),i.e, the event that A does 
not occur. The event A and 
its complement (not A) are 
mutually exclusive and 
exhaustive.it is denoted A’ , 
AS or A 


Union of Two Events: 
the union of events A and 
B is the event containing 
all sample points that are 
in A or B or both. It is 
denoted by AUB 


Intersection of Two 
Events: The intersection 
of events A and B is the 
set of all sample points 
that are in both A and B. 
it is denoted by AN B 


o Mutually 
Exclusive Events: two sets 
are mutually exclusive ( also 
called disjoint) if they do not 
have any elements in 
common; they need not 
together comprise the 
universal set. 


3.5 ADDITION THEOREM OF PROBABILITY 


The probability of an event in a random experiment as well as axiomatic 
approach formulated by Russian Mathematician A.N. Kolmogorov and 
observed that probability as a function of outcomes of an experiment. By 
now you know that the probability P(A) of an event A associated with a 
discrete sample space is the sum of the probabilities assigned to the 
sample points in A as discussed in axiomatic approach of probability. 
Here we will learn Addition Theorem of Probability to find probability of 
occurrence for simultaneous trials under two conditions when events are 
mutually exclusive and when they are not mutually exclusive. 
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1. Addition Theorem For Mutually Exclusive Events 


Statement: If A and B are two mutually exclusive events, then the 
probability of occurrence of either A or B is the sum of the individual 
probabilities of A and B. Symbolically 


P(A U B) = P(A or B) = P(A) + P(B) 
Proof : Let N be the total number exhaustive and equally likely cases of 


an experiment. Let m; and mz be the number of cases favourable to the 
happening of events A and B respectively. Then 


— MA) _™ 
P(A) =o N 
and 

— (8) _™ 
P(B) = n(S) _ No 


Since the events A and B are mutually exclusive, the total number of 
events favorable to either A or B i.e. n; AUB) = mı+m,, then 
n(AUB n(AUB m+m, m m, 
sa ee li N = = P(A) + P(B) 
n(S) N N N N 


P(AU B) = 


Example 1: A card is drawn at random from a pack of 52 cards. Find the 
probability that the drawn card is either a club or an ace of diamond. 


Solution : Let A : Event of drawing a card of club and 


B: Event of drawing an ace of diamond 


13 
P(A) =7 
The probability of drawing a card of club na 


: 1 
P(B) == 

The probability of drawing an ace of diamond nd 

Since the events are mutually exclusive, the probability of the drawn card 


being a club or an ace of diamond is: 
14 7 


13: | 
P(AU B) = P(A) + P(B) = = = 
( ) i lt 52 52 52 26 


2. Addition Theorem For Non-Mutually Exclusive Events 


The addition theorem discussed above is not applicable when the 
events are not mutually exclusive. For example, if one card is drawn at 
random from a pack of 52 cards then in order to find the probability of 
either a spade or a king card, it cannot be calculated by simply adding the 
probabilities of spade and king card because the events are not mutually 
exclusive as there is one card which is a spade as well as a king. Thus, 
the events are not mutually exclusive; therefore, the addition theorem is 
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modified as: 


Statement: If A and B are not mutually exclusive events, the probability 
of the occurrence of either A or B or both is equal to the probability that 
event A occurs, plus the probability that event B occurs minus the 
probability of occurrence of the events common to both A and B. In other 
words the probability of occurrence of at least one of them is given by 


P(AU B) = P(A or B) = P(A) + P(B)— P(ANB) 


Proof: Let us suppose that a random experiment results in a sample 
space S with N sample points (exhaustive number of cases). Then by 
definition 

n(AUB) n(AvuB) 


P(AU B) = a(S) = 


Where n(AUB) is the number of occurrences (sample points) favorable to 
the event (AUB) 


ANE 


Addition theorem for non-mutually exclusive events 


From the above diagram, we get: 


[n(A)— n(ANB)] + n(A NB) + [n(B) —n(An B)] 


P(AU B) = = 
= n(A) + n(B)— n(AnB) 
N 
_n(a) , n(6)_n(anB) 
oN N N 


= P(A) + P(B) — P(A N B) 
Example 2 
A card is drawn at random from a pack of 52 cards. Find the probability 
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that the drawn card is either a spade or a king. 
Solution: Let A: Event of drawing a card of spade and 
B: Event of drawing a king card 


The probability of drawing a card of spade 
13 
P(A) = — 
A=] 


The probability of drawing a king card 


P(B) == 


Because one of the kings is a spade card also therefore, these events are 
not mutually exclusive. The probability of drawing a king of spade is 


P(AN B) = = 


So, the probability of the drawing a spade or king card is: 
13 4 1 16 4 


pau s)=P(a) +P) -Pans == +a 


3.6 MULTIPLICATION THEOREM OF PROBABILITY 


In the addition theorem of probability for mutually exclusive events as 


well as for those events which are not mutually exclusive. In many 
situations we want to find the probability of simultaneous occurrence of 
two or more events. Sometimes the information is available that an event 
A has occurred and one is required to find the probability of occurrence 
of another event B utilizing the information about event A. Such a 
probability is known as conditional probability. Here we shall discuss the 
important concept of conditional probability of an event which will be 
helpful in understanding the concept of multiplication theorem of 
probability as well as independence of events. 


1. Multiplication Theorem for Independent Events 


Statement: This theorem states that if two events A and B are 
independent then the probability that both of them will occur is equal to 
the product of their individual probabilities. 


P (AB) = P (ANB) = A (A and B) =P (A). P (B) 
Proof 


If an event A can happen in n, ways out of which a; are favorable and the 
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event B can happen in nz ways out of which az are favorable, we can 
combine each favorable event in the first with each favorable event in the 
second case. Thus, the total number of favorable cases is a; xX az. 
Similarly, the total number of possible cases is nı x n2. Then by definition 
the probability of happening of both the independent events is 


P(ANB) = P(A and B) = = 2X2 = P(A)XPEB) 


as P(A) = = & P(B)=—2 
n nz 


Similarly we can extend the theorem to three events 


P (ANBNC) =P (A) . P (B/A) -P(C/A . B) 


Example 1. From a pack of 52 cards, two cards are drawn at random one 
after the other with replacement. What is the probability that both cards 
are kings? 


Solution: 
The probability of drawing a king P (A) = 
The probability of drawing again the king after replacement P (B) = Z 


Since the two events are independent, the probability of drawing two 
kings is: 
P(A and B) = P(ANB) = P(A) x P(B) =~ x4 =+ 
52 52 169 
2. Multiplication Theorem of Probability for Dependent Events 
Statement: The probability of simultaneous happening of two events A 
and B is given by: 
P(AN B) = P(A).P(BIJA); P(A) +0 
P(B N A) = P(B).P(AIB); P(B) + 0 
Where P (BJA) is the conditional probability of happening of B under the 


condition that A has happened and P (AJB) is the conditional probability 
of happening of A under the condition that B has happened. 


Proof: 


Let A and B be the events associated with the sample space S of a 
random experiment with exhaustive number of outcomes (sample points) 
N, i.e., n(S) = N. Then by definition 
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— 2(AnB) 
P(AN B) =" 
For the conditional event A|B (i.e., the happening of A under the 
condition that B has happened), the favorable outcomes (sample points) 
must be out of the sample points of B. In other words, for the event AJB, 
the sample space is B and hence 


_ n(AnB) 
P(AJB) = ~n(B) 
Similarly, we have 

_n(BNA) 
P(BJA) = E 


On multiplying and dividing equation (1) by n (A), we get 
n(A) n(ANB) 


P(AN B) = n(S) x ETN 
= P(A). P(BIA) 

Also 

P(AN B) = TE x mion 


= P(B).P(A]B) 


Example 


A bag contains 5 white and 8 red balls. Two successive drawings of 3 
balls are made such that (a) the balls are replaced before the second 
drawing, and (b) the balls are not replaced before the second draw. Find 
the probability that the first drawing will give 3 white and the second 3 
red balls in each case. 


Solution: 
(a) When balls are replaced. 
Total balls in the bag = 8 + 5 = 13 
3 balls can be drawn out of total of 13 balls in '°C3 ways. 


3 white balls can be drawn out of 5 white balls in °C3 ways. 


P(3W) = = = = 


Probability of 3 white balls = 


Since the balls are replaced after the first draw so again there are 
13 balls in the bag 3 red balls can be drawn out of 8 red balls 
in °C3 ways. 
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Probability of 3 red balls = — = 


Since the events are independent, the required probability is: 


5 8 
_ _ "Cy _ 10. 56 _ 140 
P(3W and 3R) = P(3W) x P(3R) = ee ea ee 


(b) When the balls are not replaced before second draw 
Total balls in the bag = 8 + 5 = 13 
3 balls can be drawn out of 13 balls in '°C3 ways. 
3 white balls can be drawn out of 5 white balls in °C3 ways. 


P(3W) = a 
The probability of drawing 3 white balls = Cs 
After the first draw, balls left are 10, 3 balls can be drawn out of 
10 balls in oc, ways. 


3 red balls can be drawn out of 8 balls in °C; ways. Probability of 
Eca 


drawing 3 red balls = “s, 


Since both the events are dependent, the required probability is: 


d _ _ 5c, 3e ~ = 
P(3W and 3R) = P(3W)xP(3RISW) = 32 X ad = ia X i as 


3.7 CONDITIONAL PROBABILITY 


When the occurrence of an event A and are required to find out the 
probability of the occurrence of another event B. Two events A and B are 
said to be dependent when event A can occur only when event B is 
known to have occurred (or vice versa). The probability attached to such 
an event is called the conditional probability and is denoted by P (A|B) or 
in other words, probability of A given that B has occurred. For example, 
if we want to find the probability of an ace of spade if we know that card 
drawn from a pack of cards is black. Let us consider another problem 
relating to dairy plant. There are two lots of full cream packets A and B, 
each containing some defective packets. A coin is tossed and if it turns up 
with its head upside lot A is selected and if it turns with tail up, lot B is 
selected. In this problem we are interested to know the probability of the 
event that a milk packet selected from the lot obtained in this manner is 
defective. 


Definition: If two events A and B are dependent, then the conditional 
probability of B given that event A has occurred is defined as 
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P(ANB) - 


P(BIA) = — (A) „if P(A)>0 


Let us consider the experiment of throwing of a die once. The sample 
space of this experiment is {1, 2, 3, 4, 5, and 6}. 


Let E;: an even number and Ez: multiple of 3 p. 
Then E;: {2, 4,6} and E2: {3, 6}. 
Hence, P (E1) = 3/6 = 1/2 and P (E1) = 2/6 =1/3 


In order to find the probability of occurrence of E2 when it is given that 
E, has occurred We know that in a single throw of die2or4or 6has come 
up. Out of these only 6 is favorable to E2. So the probability of 
occurrence of Ez when it is given that E; has occurred is equal to 1/3. This 
probability of E2 when E; has occurred is written as 


P (E,|E)). Here we find that P (E2|E,) =P (E2). 
Let us consider the event 


Es: a number greater than 3 then E3:{4,5,6} and P(E3)=3/6=1/2 
Out of 2,4 and 6, two numbers namely 4 and 6 are favorable to E3. 


Therefore, P (E3|E1) =2/3. 


The events of the type E;andE,are called independent events as the 
occurrence or non-occurrence of E; does not affect the probability of 
occurrence or non-occurrence of E>. The events Eand E; are 
not independent. 


3.7.1 Combined Use Of Addition And Multiplication Theorem 


In probability both addition and multiplication theorems are used 
simultaneously. The following examples illustrate the combined use of 
addition and multiplication theorems. 


Example 


A bag contains 5 white and 4 black balls. A ball is drawn from this bag 
and is replaced and then second draw of a ball is made. What is the 
probability that two balls are of different colors. 


Solution: There are two possibilities 
i) First ball is white and the second ball drawn is black. 
ii) First ball is black and the second ball drawn is white. 


Since the events are independent, so by using multiplication theorem we 
have 


1) Probability of drawing First ball white and the second ball 
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54 20 


black=9 9 41 


ii) Probability of drawing First ball black and the second ball white 
al ae aid 
-9 9 S 


Since these probabilities are mutually exclusive, by using addition 
theorem 
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a 


Probability that two balls are of different colors = 


CHECK YOUR PROGRESS 
1. What is sample space? 
2. What is an event? 
3. Write the formula for addition probability theorem 
4. Mention the types of probability 


5. How Baye’s theorem is calculated 


3.8 BAYES’ THEOREM AND ITS APPLICATIONS 


There are many situations where the ultimate outcome of an 
experiment tdepends on what happens in various intermediate 
stages. This issue is resolved by the Bayes’ 

There is a very big difference between P(A | B) and P(B | A) 


Suppose that a new test is developed to identify people who are 
liable to suffer from some genetic disease in later life. Of course, no test 
is perfect; there will be some carriers of the defective gene who test 
negative, and some non-catriers who test positive. So, for example, let A 
be the event ‘the patient is a carrier’, and B the event ‘the test result is 
positive’. 

The scientists who develop the test are concerned with the probabilities 
that the test result is wrong, that is, with P(B | A’ ) and P(B' | A). 
However, a patient who has taken the test has different concerns. 

If I tested positive, what is the chance that I have the disease? 

If I tested negative, how sure can I be that I am not a carrier? In other 
words, P(A | B) and P(A '| B’). 


These conditional probabilities are related by Bayes’ Theorem: 
Let A and B be events with non-zero probability. Then 


P(B| A)-P(A 
P(A|B)= SAT 


72 


The proof is not hard. We have 


P(A | B)-P(B) = P(ANB) = P(B | A): P(A), 

using the definition of conditional probability twice. (Note that we need 
both A and B to have non-zero probability here.) Now divide this 
equation by P(B) to get the result. 


If P(A) # 0,1 and P(B)¥ 0, then 


= P(B | A)-P(A) 

P(A | B) = P(B | A)-P(A) +P(B|A’)-P(A’) ° 
Bayes’ Theorem is often stated in this form. 
Example 


Consider the clinical test described at the start of this section. Suppose 
that 1 in 1000 of the population is a carrier of the disease. Suppose also 
that the probability that a carrier tests negative is 1%, while the 
probability that a non carrier tests positive is 5%. (A test achieving these 
values would be regarded as very successful.) Let A be the event ‘the 
patient is a carrier’, and B the event ‘the test result is positive’. We are 
given that P(A) = 0.001 (so that P(A’ ) = 0.999), and that 
P(B| A) =0.99, P(B|A’)=0.05. 

(a) A patient has just had a positive test result. What is the probability 

that the patient is a carrier? The answer is 


= P(B | A)-P(A) 
P(A | B) = P(B | A)-P(A) +P(B | A’)-P(A’) 
0.99x0.001 
(0.99x0.001)+ (0.05 x0.999) 


0.00099 
ETET = 0.0194. 


(b) A patient has just had a negative test result. What is the 


probability that the patient is a carrier? The answer is 
P(B'|A)P(A) 


P(A|B')= P(B'| A)P(A)+ P(B’| A)P(A) 
_ 0.010.001 
~ (0.010.001)+ (0.950.999) 
0.00001 
=a onde = 0.00001. 
3.9 SUMMARY 


e Bayes’ Theorem is often stated in the form. If P(A) # 0,1 and 
P(B)# 0, then 


P(A | B) = P(B | A)-P(A) 


P(B | A)-P(A) +P(B | A')-P(A’) 
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e Conditional Probability: Two events A and B are said to be 
dependent when event A can occur only when event B is known 
to have occurred (or vice versa). 

e Multiplication Probability :The probability of simultaneous 
occurrence of two or more events 

e Addition Probability: If A and B are not mutually exclusive 
events, the probability of the occurrence of either A or B or both 
is equal to the probability that event A occurs, plus the probability 
that event B occurs minus the probability of occurrence of the 
events common to both A and B 

e Types Of Probability: Axiomatic Approach,Classical Approach 
„Relative Frequency Theory of Probability,Subjective Approach 


3.10 KEY WORDS 


Probability, Sample, Events, Variables, Addition theorem, Multiplication 
theorem, Axiomatic approach, Classical approach, Relative frequency 
theory, Subjective approach, Baye’s theorem 


3.11 ANSWER TO CHECK YOUR PROGRESS 


1. Sample Space :Sample Space is the set of all possible 
outcomes of an experiment. It is denoted by S 

2. Event :Any subset of a Sample Space is an event. Events 
are generally denoted by capital letters A, B , C, D etc. 

3. Addition Theorem For Mutually Exclusive Events 
P(AU B) = P(A or B) = P(A) + P(B) 

Addition Theorem For Non-Mutually Exclusive Events 
P(AU B) = P(A or B) = P(A) + P(B) — P(ANB) 

4. Types Of Probability: Axiomatic Approach,Classical 
Approach „Relative Frequency Theory of 
Probability,Subjective Approach 

5. Bayes’ Theorem is often stated in the form. If P(A) # 0,1 
and P(B)# 0, then 

P(A | B) = P(B | A)-P(A) —_— 
P(B | A)-P(A) +P(B | A’)-P(A’) 


3.12 QUESTIONS AND EXERCISE 


SHORT ANSWER QUESTION: 
1. Define probability 
2. What are sample space 
3. Define random variable 
4. State the Baye’s theorem 
5. Explain mutually exclusive event 


LONG ANSWER QUESTIONS: 


1. Define probability and bring out the importance of probability 
2. Distinguish between independent and dependents events 
3. Explain briefly Baye’s theorem 
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4. If 20% of the bottles produced by machine are defective, 
determine the probability that out of 4 bottles (4) 0, G) 1, Gi) 
at most 2 bottles will be defective 


3.13 FURTHER READINGS 


1. 


2. 
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Statistics (Theory & Practice) by Dr. B.N. Gupta. SahityaBhawan 
Publishers andDistributors (P) Ltd., Agra. 

Statistics for Management by G.C. Beri. Tata McGraw Hills 
Publishing CompanyLtd., New Delhi. 

Business Statistics by Amir D. Aczel and J. Sounderpandian. Tata 
McGraw HillPublishing Company Ltd., New Delhi. 

Statistics for Business and Economics by R.P. Hooda. MacMillan 
India Ltd., NewDelhi. 

Business Statistics by S.P. Gupta and M.P. Gupta. Sultan Chand 
and Sons.,NewDelhi. 

Statistical Method by S.P. Gupta. Sultan Chand and Sons., New 
Delhi. 
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UNIT IV - PROBABILITY 
DISTRIBUTION 


Structure 
4.0 Introduction 


4.1 Objectives 

4.2 Random Variable 
4.3Types of Random Variable 
4.4 Binomial Distribution 

4.5 Poisson Distribution 

4.6 Normal Distribution 

4.7 Summary 

4.8 Key Words 

4.9 Answer to Check your progress 
4.10 Questions and Exercise 
4.11 Further Reading 


4.0 INTRODUCTION 


A probability distribution is a table or an equation that links each 
outcome of a statistical experiment with its probability of occurrence. It 
describes the range of possible values that a random variable can attain 
and the probability that the value of the random variable is with any 
subset of that range. For example if X is a random variable then denote 
by P(X) to be the probability that X Occurs. It must be the case that 0<= 
P(X) <= 1 for each value of X and XP(X) = 1 ( the sum of all the 
probabilities is 1) 


4.1 OBJECTIVES 


The students will be able to understand 
e The random variable and its types in probability distribution 
e Concept of Binomial Distribution, Poisson Distribution and 
Normal Distribution 
e Concept of Mean and Standard deviation of Binomial and Poisson 
Distribution 


4.2 Random variable 


A random variable is defined as a real number X connected with 
the outcome of a random experiment E. For example, if E consists of 
three tosses of a coin, we may consider random variable X which denotes 
the number of heads (0, 1, 2 or 3) 

Outcome HHH | HTH | THH | THH | HTT | THT | TTH | TTT 
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Value of | 3 2 2 2 1 1 1 0 
X: 


Thus, to every outcome there corresponds a real number X(w). Since the 
points of the sample space corresponds to outcomes, this means that a 
real number, which we denote by X(w), is defined for each wES and let 
us denote them by wy, W2,....,Wg 1.e.X(w1) = 3, X(w2)=2,...., X (ws ) =0. 
Thus,, we define a random variable as a real valued function whose 
domain is the sample space associated with a random experiment and 
range is the real line. Generally it is denoted by capital letters X,Y, Z, --- 
etc. 


4.3 TYPES OF RANDOM VARIABLE 


1. Discrete random variable 

If a random variable X assumes only a finite or countable set of 
values, it is called a discrete random variable. In other words, a real 
valued function defined on a discrete sample space is called a discrete 
random variable. In case of discrete random variable we usually talk of 
values at a point. Generally it represents counted data. For example, 
number of defective milk packet in a milk plant, number of students in a 
class etc. 
2. Continuous random variable 

A random variable is said to be continuous if it can assume 
infinite and uncountable set of values. A continuous random variable is 
in which different values cannot be put in one to one correspondence 
with a set of positive integers. For example, weight of baby elephant take 
any possible value in the interval of 160 kg to 260 kg, say 189 kg or 
189.4356 kg; likewise, marks scored by the students in a class etc. In 
case of continuous random variable we usually take the values in a 
particular interval. Continuous random variables represent measured 
data. 
Probability Distribution of a Random Variable 

The concept of probability distribution is equivalent to the 
frequency distribution. It depicts how total probability of one is 
distributed among various values which a random variable can take. 


Mean and Variance of a Random variable 

Let X denotes the random variable which assumes values x1,X2,---,X, With 
corresponding probabilities p1,p2,---,pn -Then the probability distribution 
be as follow: 


Then 
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2a =p, +p ++p, 51 

i=1 

The mean (p) of the above probability distribution is defined as: 
_ PyX;+p2xX,+-"+p,xX, DPX; Y px 


P,+p,+--+p, Xp; 
The variance (0°) is defined as: 


= Ya -pp = >? +p? — 2x;u)p; = Y xp + Y p - au) xP, 


= Jan + p?(1) —2p(p) = Dan -př = Jan = D px) 


Mean of a random variable X is also known as expected value and is 
denoted by E(X) 


E(X) = p= pyX, + P287 +" + PaXn =) px 
Variance (o*) = E(X”) — (E(X))” 
Example 
A die is tossed twice. Getting a number greater than 4 is 


considered a success. Find the variance of the probability distribution of 
the number of success. 


Solution: 
Here p, probability of a number greater than _{=2/6=1/3 and q, 
probability of a number not greater than T “33 
P(X= 0) =qxq==x=== 
P(X= 1)=pxq+qxp=_X=+=x==> 


1 1 


Thus, we have: 


P;*; 
0 
4/9 
Hence, the 
2/9 
6/9 
— = ¡Xi = - = — 
mean j F 3 3 


The variance 
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= z2_,2_8 (6\)_8 3% _72-% 3% _4 
o =ZPix; e = () 9 2 
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4.4 BINOMIAL DISTRIBUTION 


Binomial distribution is a discrete probability distribution. This 
distribution was discovered by a Swiss Mathematician James Bernoulli 
(1654-1705). A Bernoullian trial is an experiment having only two 
possible outcomes i.e. success or failure. In other words the result of the 
trial are dichotomous e.g. in tossing of a coin either head or tail, the sex 
of a calf can be either male or female, a manufactured milk product or an 
engineering equipment or spare part will be either defective or non 
defective etc. This distribution can be used under the following 
conditions: 

a) The random experiment is performed repeatedly a finite and fixed 
number of times i.e. n, the number of trials is finite and fixed. 

b) The outcome of a trial results in the dichotomous classification of 
events i.e. each trial must result in two mutually exclusive outcomes, 
success or failure. 

c) Probability of success (or failure) remains same in each trial i.e. in 
each trail the probability of success, denoted by p remains constant. q=1- 
p, is then termed as the probability of failure (non-occurrence). 

d) Trials are independent i.e. the outcome of any trial does not affect the 
outcomes of the subsequent trials. 

Theorem: 

If X denotes the number of successes in n trials satisfying the 
above conditions, then X is a random variable which can take values 
0,1,2,....,n Le. no success, one success, twO SUCCESSES, ......... , or all the 
n successes. The general expression for the probability ofr successes is 
given by:P(r) = P(X =r) ="C,p'q" for r=0,1,2,....,n 
Proof : 

By the theorem of compound probability, the probability that r 
trials are success and the remaining (n-r) are failures in a sequence of n 
trials in a specified order say S,F,S,F,S,.....,8 is given by 


P(SN FNSNENEN———NS) = P(s)P(F)P(S)P(F)P(F) -— —P(S) 
— p-q-p.q-q —— —p 
= (pxpxpx — — —r times )x(qxqxq—(n—r)times) = p™q®*” 


But we are interested in any r trials being successes and since r trials can 
be chosen out of n trials in "C,(mutually exclusive) ways. Therefore, by 
the theorem of total probability, the chance P (r) of r successes in a series 


of n independent trials is given by 


P (r) = “Cpg?” 0<r<n 


r can take only positive integer values. 
79 


probability distribution 


NOTES 


Self-Instructional Material 


probability distribution 


NOTES 


Self-Instructional Material 


Thus, the chance variate i.e. the number of successes, can take the values 
0,1,2,..,r,..,n with corresponding probabilities 
qc, p d e CPT wep 
e The probability distribution of the number of successes so 
obtained is called the binomial probability distribution for the 
obvious reason that the probabilities are the various terms of the 
binomial expansion of (q+p)". 


e The sum of probabilities 


>, (+) = (0) + p(1) + p(2)+ —— P(e) 


r=0 4 
=q" + "Cpa +——— +"C, pq? ™ +—-——+p"=(qtp)” 
=1 


e The expression for P (X = r) is known as probability mass 
function of the Binomial distribution with parameter n and p. 
The random variable X following this probability law is called 
binomial variate with parmeter n and p denoted as 
X~B(n,p).Hence binomial distribution can be completely 
determined if n and p are known . 


Example : 

It is known that 40 % patients affected by tuberculosis die every 
year. 6 patients are admitted to a hospital suffering from tuberculosis. 
What is the probability that 

(i) Three patients will die. 
(ii) at least patients will die 
(iii) all patients will be cured 
(iv) no patients will be saved. 
Solution 
we have p = 0.4, q = 1- 0.40 = 0.6 and n=6 
In binomial distribution we have 
P(r) = "C, p" .q’™ 

(i) Prob. [Three patients will die] 

P[r = 3] = P(3) = °Cs. (0.4)° (0.6)° 
(0.4)? (0.6)? = 20(0.4)* (0.6)? = 0.2765 


313! 
(ii) Prob. (at least five patients will die) 
P(5) + P(6) = C; (0.4)° (0.6)! + °C¢ (0.4)°(0.6)" 
= 6 (0.4)° (0.6)! + (0.4)° 
= 0.0369 +0.0041= 0.0410 


P(3) = 


(iii) Prob. (all patients will be cured) =1 - P (no patients will die) 
1- P(O) =1 - Co (0.4)°(0.6)° 
= 1-(0.6)° 
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= 1 - 0.0467 = 0.9533 
(iv) Prob. (no patients will be saved) = P (all patients will die) 
= P(6) 
= °C¢ (0.4)° (0.6) 
= (0.4)° =0.0041 


Example of Binomial distribution 
e The number of heads/tails in a sequence of coin flips 


e Vote counts for two different candidates in an election 


e The number of male/female employees in a company 


e The number of accounts that are in compliance or not in 
compliance with an accounting procedure 


e The number of successful sales calls 
e The number of defective products in a production run 


Properties of Binomial Distribution 


i) Mean of binomial distribution is np. 


Proof: First raw moment 
py = E(t) = Lor me.p'q™ * = Ero Z PP” a" 
rn{n-1)! 
= he ae PP a = mp Een tc, pg 
a npn—1 "S a ia = 
npon- 1g, p'a =np(qtp)** = np 


ii) Variance of binomial distribution is npq 


Proof: Second raw moment 


Hy = Ee?) = ) nev a =) frl- Dnepr 


=) rag pra" +) rfr- 1) ng p*q™ * = np +n(n— 1)p? = np + np” — np? 
r=0 r=0 

Variance 

- M= Ha — (44)? = np +n’ p* —np* —n*p? = np(1 —p) = npq 
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For the binomial distribution if mean and variance are known, we 
can arrive at the frequency distribution and variance is less than mean. 


iii) The third and fourth central moment us and u4 can be obtained on the 
same lines. 
H, = npq(q—p) 
m, = npq[1 +3(n— 2)pa] 
iv) Pearson’s constants B; & P2 as well as yı and y2 are given by 


= — = eee — _ (a—2p) 
pš (npa)? npa r= VE. = 


H4 npq[i+3(n—2) pq] 1—6pq 1—6pq 
SS 3 —, = — 3 = — 
Bz u? (npa)? ii npa ’ Y2 Bz npq 


yı shows that the binomial distribution is positively skewed if q > p or p 
< 1/2 and it is negatively skewed if q < p or p >1/2 and it is symmetrical 
if p = q = 1/2.The binomial distribution is leptokurtic if pq< 1/6and 
latykurtic if pq> 1/6. 

v) Mode of binomial distribution is determined by the value (n+1)p. If 
this value is an integer equal to k then the distribution is bi-modal, the 
two modal values being X=k and X=k-1.When this value is not an 
integer then the distribution has unique mode at X=k;, the integral part of 
(n+1)p. 

vi) Additive property: If X; is B(nı,p)and X2 is B(m,p) and they are 
independent then their sum X; + X2 is also a binomial variate B(nj4 n2,p). 
Example: 


If the mean and variance of a Binomial Distribution are 
respectively 9 and 6, find the distribution. 


Solution: Mean of Binomial Distribution is np and variance is npq 


= np = 9 and npq = 6 


m 6 2 
w — =- = — 
: . " 3 

4 ee ak 
oe 


1 
=e 2 >n=3xX9=27 
(+5) 
Hence, the Binomial Distribution is 3 3 
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7C,(1/3)" (2/3)""* 


4.5 POISSON DISTRIBUTION 


Poisson distribution is a limiting case of Binomial distribution 
under the following conditions: 
e n, the no. of trials is indefinitely large i.e., n—00 


e p, the constant probability of success for each trial is 
indefinitely small i.e. p>0 


e np= m (say) is finite. Thus, p = m/n, q = 1- m/n where m is a 
positive real number. 


Under, the above three conditions the probability mass function of 
binomial distribution tends to the probability mass function of the 
Poisson distribution whose definition and derivation given below: 

A random variable X is said to follow a Poisson distribution if it 
assumes only non-negative values and its probability mass function is 
given by 


p(r,m) =P (x= m) =~ 


r 
m 


, r= 0,1,2, aso oos ano „20 
r! 


where m is known as the parameter of the distribution. 


e = 2.7183(the base of the natural logarithm) 


x* x 
e-ititata scot esas aaa oo 
a xX x2 x? £ r 
ae alg gt Wae 


Proof: As n—o and np =m 
p = m/n and q=1-n/n 
Probability function of binomial distribution is 
P (r) =°C p" qv =nl/rl(n-r)! p'q™ 
_ rma) ne] ey ( 1= =)" 


r! 


-£(-9 (0-90-02 0-9" 


Taking limit as noo 
m n -r 
=~ (1—0)(1—0)——(4—0)x lim (1—2) lim 1-2) 
We know that 
lim (1 -27 =e ™ 
n 


nom 


lim (1-2) =1 ais nota function ofn 


noum 
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Thus, P(r) = - == -a = 0,1,4, —— —oo 
Putting r = 0,1,2, ---- in above equation, we obtain the probabilities of r = 
ep e m 
—m — 
0,1,2, ---- successes respectively we getÊ , $ , 2 „-- 


Total probability is 1: 


eop) = Dea? (x= 1) = Ley — = DegP (a=) =e ™ + E 


r? r t 


=e=(1 +m4+5+5+--) eai =e" 1 


If we know m, all the probabilities of the Poisson distribution can be 
obtained, therefore m is the only parameter of the Poisson distribution. 
The application of this distribution in solving problems is illustrated 
through following examples. 
Example 

A manufacturer of screws knows that 5% of his product is 
defective. If he sells his product in a carton of 100 items and guarantees 
that not more than 10 items will be defective. What is the probability that 
the carton will fail to meet the guaranteed quality? 
Solution: 

In this example p = 0.05, n = 100. Therefore, m = n. p = 100 
(0.05) =5 


Prob. [That the carton will fail to meet the guaranteed quality] = 1- Prob. 
[The carton will meet the guaranteed quality] = Prob. [Not more than 10 
items will be defective] = 1 - P [r< 10] 

= 1- [P(O) + P(1) + P(2) + P(@)+............. + P(10)] 


P(e) = *—= 


In case of Poisson distribution 


Therefore, we have 


e*s. e*s e757 os 
ee ee aE Mar ae T. 
=e s* s =... 5 — — = 
1e [1+54+54+5+-—+ =] = 1-039865 = 0.0135 


Examples of Poisson Distribution 
e The hourly number of customers arriving at a bank 
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e The daily number of accidents on a particular stretch of highway 
e The hourly number of accesses to a particular web server 

e The daily number of emergency calls in Dallas 

e The number of typos in a book 


e The monthly number of employees who had an absence in a large 
company 


e Monthly demands for a particular product 


Properties of Poisson Distribution 
i) Mean of the Poisson ee ism 


> ee ye 


r=0 
m m? 
meas = lia 


3! 


ii) Variance of the Poisson distribution is 


Variance = y r*p(r) — (> r 0) = ) r*p(r) — (m)? 


r=0 r=0 
where, 
, = = em: z e ™m 
u = =Y lr+r( 1)] ae yr Sy 
r=0 r=0 r=0 


mt? , ~ g 
=m+e™m? ) =mte “m |1+m+—+—+--- 

(r — 2)! 2! 3! 

=0 


=m+e™m’*e™=mt+m 
Variance = p, — (41)? = m+m? — (m)? _ i 
Hence, for Poisson distribution with parameter m mean is equal to 
variance. 
iii) Third and fourth central moments u3 and u4 
u,=m, p,= 3m? +m 


iv) Pearson’s Coras Bı & Bo as vat as yı and y2 are given by 


(m)? 


Hg 
By us (m)? n= VJB = 

us  3m”™+m 1 1 
Bo = 3 = Gm)? =r y2=fy 2 =— 


It may be noted that the first three central moments of the Poisson 
distribution are identical and are equal to the value of parameter itself 
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namely `m’. Hence Poisson distribution is always a positively skewed 
distribution as m>0 as well as leptokurtic. As the value of m 
increases y; decreases and the thus Skewness is reduced for increasing 
values of m. As m—0oo, yı and yz tend to zero. So we conclude that as 
m—o, the curve of the Poisson distribution tends to be symmetrical 
curve for large values of m. 
v) Mode of Poisson distribution is determined by the value m. If m is an 
integer then the distribution is bi-modal, the two modal values being 
X=m and X=m-1.When m is not an integer then the distribution has 
unique modal value being integral part of m. 
vi) Additive property: If X;and Xare two independent 
Poisson variate with parameters mjand m, then their sum X; + Xz is also 
a Poisson variate with parameter m; + m, 
Example 

The mean of the Poisson distribution is 2.25. Find the other 
constants of the distribution. 
Solution: 


We have ™ 
o = ym = V2.25 = 1.5 


= 2.25 


u =0 
u = m = 2.25 
ps = m = 2.25 


II 


u, = m +3m? = 2.25 + 3(2.25)? = 2.25 + 15.1875 = 17.4375 


B E = 0.444 
1 m 225 ` 


1 
B- =3 + — = 3 + 0.444 = 3.444 
m 


1 1 


Yı = VB, 0.67 


ym 15 


1 1 
»5=—6,-3=3+ 3= = 0.444 
Y2 = Ba m 225 


This curve is positively Skewed and Leptokurtic. 
CHECK YOUR PROGRESS - 1 


1. List the types of random variable 
2. What are the properties of Binominal distribution? 


3. List out few example of Poisson distribution 


4.6 NORMAL DISTRIBUTION 


Normal distribution is one of the important distribution in 
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continuous probability distribution. Normal distribution is probably the 
most important and widely used theoretical distribution. Normal 
distribution unlike the Binomial and Poisson is a continuous probability 
distribution. It has been observed that a vast number of variables arising 
in studies of agricultural and dairying, social, psychological and 
economic phenomena tend to follow normal distribution. The normal 
distribution was first discovered by French Mathematician Abraham De- 
Moivre in 1733, who obtained this continuous distribution as a limiting 
case of the Binomial distribution. But it was later rediscovered and 
applied by Laplace and Karl Gauss. It is also known as Gaussian 
distribution after the name of Karl Friedrich Gauss. 

A continuous random variable X is said to have a normal distribution 
with parameters u (mean) and o (standard deviation), if its density 
function is given by the probability law 


oD ee 
x)= .e@ 20° ,—00 < x< 00 
J21.60 


where a and e are given by a = 22/7 and e=2.7183 (base of natural 
logarithms). 


Properties of normal distribution: 


1) A random variable X with mean and variance o 7 following the 
normal law given above is represented as X~N(u, 6 5, 
= 
2) If X~N(u, o *) then 7 , Zis defined as a standard normal 
variate with E(Z)=0 and Var (Z)=land we write Z~N(0, 1) 
3) The p.d.f. of a standard normal variate Z is given by 


= X — 
@(z) = ——.e 2,—% < z < W,whereZ = miia 
V2 o 
4) Normal distribution is a limiting form of the binomial distribution 


when 
a) n, the number of trials is indefinite large, i.e. n—0o and 
b) neither p nor q is very small. 
5) Normal distribution is a limiting form of Poisson distribution when 
its mean m is large and n is also large. 


Characteristics of Normal Distribution 
It has the following properties: 
1. The graph of f(x) is bell shaped unimodal and symmetric curve as 
87 


probability distribution 


NOTES 


Self-Instructional Material 


probability distribution 


NOTES 


Self-Instructional Material 


9. The coefficient of kurtosis is given by B2 
10. No portion of the curve lies below the x-axis, since f(x) being the 


shown in the Fig. 12.1. The top of the bell is directly above the 
mean (WU). 


M = Mp = Ma 


Normal probability curve 
The curve is symmetrical about the line X = u, (Z = 0) i.e., it has 
the same shape on either side of the line X = p, (or Z = 0). This is 
because the equation of the curve @(z) remains unchanged if we 
change z to -z. 
Since the distribution is symmetrical, mean, median and mode 
coincide. Thus, Mean = Median = Mode = u 
Since Mean = Median = Mode = u, the ordinate at X = yu, (Z = 0) 
divides the whole area into two equal parts. Further, since total 
area under normal probability curve is 1, the area to the right of 
the ordinate as well as to the left of the ordinate at X = u (or Z = 
0) is 0.5 
Also, by virtue of symmetry the quartiles are equidistant from 
median (p), i.e, 
Q3 — Ma = Ma — Qı 

Since the distribution is symmetrical, all moments of odd order 
about the mean are zero. Thus un = 0; (n = 0,1,2,......... ) 
i.e. Hi = W3 = Us = --- = 0. 
The moments (about mean) of even order are given by 

ta = 135...(2n— 1)o™, m= 1,2,3 ...) 


Putting n=1 and 2 we get 
w= and pų = 30" 


2 
~B, =#=0 

Bı u 
and 

E E ee 
<io a 


. Since the distribution is symmetrical, the moment coefficient of 


skewness based on moments is given by a ee 


=3 > y, = 0 


probability can neverbe negative. 
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11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


Theoretically, the range of the distribution is from -æ < to < œ. 
But practically, range = 60 
As x increases numerically [i.e. on either side of X = u], the value 
of f(x) decreases rapidly, the maximum probability occurring at X 
= p and is given by 
tle = 
Thus maximum value of f(x) is inversely proportional to the 
standard deviation. For large values of o, f(x) increases, i.e., the 
curve has a normal peak. 
Distribution is unimodal with the only mode occurring at X = 
LL. 
X-axis is an asymptote to the curve i.e., for numerically large 
value of X (on either side of the line (X = p)), the curve becomes 
parallel to the X-axis and is supposed to meet it at infinity. 
A linear combination of independent normal variates is also a 
normal variate. If X1 X2, ....,Xn are independent normal variates 
with mean ui,u2, ....,pnand standard deviations 061,62, .... , 
On respectively then their linear combination 
a,X, + a,X, +--+ a,X, 


Where aj, a2, ..... , an are Constants, is also a normal variate with 
Mean = aj[ti+ azp +.... + an Hn and Variance = a1” 01 + ay o7 + 
aia + an” On. In particular, if we take a; = a2=....... = an =l 
then we get X1+X2 +........ +X, is a normal variate with mean 
Witpot eee +n and variance o1 + 62° +... + on .Thus, the 
sum of independent normal variates is also a normal variate with 
mean equal to sum of their means and standard deviation equal to 
square root of sum of the squares of their standard deviations. 
This is known as the Re-productive or Additive Property of the 
Normal distribution. 


Mean Deviation (M.D.) about mean or median or mode is given 
by 
2 4 
M.D.= |-.c=-—o 
T 5 


Quartiles are given (in terms of u and o) by 
Q, = p— 0.67450 and Q, = u + 0.67456 


Quartile deviation (Q.D.) is given by 


2 
GD.= Qs — Ot L 9 674562 Žo 
2 3 Also 
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19. We have (approximately): 


GD. :M.D;S. D.: =0:=0:6 i 


a > Q.D.:M.D.:S.D.:: 10:12:15 
From property 18 we also have IERS SM E= 
20. Points of inflexion of the normal curve are at X = u + o Le. they 


are equidistant from mean at a distance of o and are given by : 


1 —i/ 
x=pto f= yee 


Cv 


21. Area property: One of the most fundamental property of the 
normal probability curve is the area property. If X - N (u, 0°), 
then the probability that random value of X will lie between X= u 
and X= x; is given 


_ (x-y)? 
P (p< X <x) = L f (x).dx = = = e 20 dx 


_XĻẸ _ _ _ _ XH 
put r >x=ptoz;+ at x = „z= 0; andat x=x,,z =—— 


». P (0< x <x,)=P(0<z<z1) 


i > 
— -z^ 
= + Ste 2 dz = f“ Ø (z)dz 


z = 
Where ven , is the probability function of standard normal 


E Ø(z)dz 


variate. The definite integral ,iS known as Normal 
Probability integral and gives the area under standard normal curve 
between the ordinate z=0 and z = zı. These areas have been provided in 
the form of table for different values of z; at the intervals of 0.01 which 
are available in any standard text books of statistics. 

Particular Cases: 

1. In particular, the probability that a random variable X lies in the 
interval (u-0, ut+o) is given by 

P(u—o <X < pto)= JP"? f(x)dx. 


1 1 
P(—1<Z<1)= | Ø @)dz=2 | Ø (z)dz = 2(0.3413) = 0.6826 
—1 o 


The area under the normal probability curve between the ordinates at 
X= u-o and X= ito is 0.6826.In other words, the range X = u-o covers 
68.26% of the observations (as shown in Fig). This is known as 1o limit 
of normal distribution 

This is known as lo limit of normal distribution 
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<- 68.26% 


-3c p-2o p-—co X=p p+o pt+26 p+3e 
Z=3 Z=-2 Z=-1 Z=0 Z=+1 Z=+2 Z=+3 


1o, 26 and 3c under Normal Probability Curve 


2. The probability that random variable X lies in the interval (u-2o0, 
u+2o0) is given by 


P (w—20 < X < p+20) = | 
p-20 


pt2o 


f (x)dz.> P (-2< Z <2) = fo (z)dz 


F 
= 2 f Ø (z)dz = 2(0.47725) = 0.95445 
o 


The area under the normal probability curve between the ordinates at 
X= u-20 and X= u+20 is 0.95445. In other words, the range X = u+20 
covers 95.445% of the observations (as shown in Fig.). This is known as 
20 limits of normal distribution and is considered as warning limit in 
case of statistical quality control which implies that it is a warning to the 
manufacturer that the manufacturing process is going out of control. 

3. The probability that random variable X lies in the interval (u-—3o0, 
u+30) is given by 

P (p—30<X<p+30= PC-3<Z<3) 


3 3 
Ø (z)dz= 2 f Ø (z)dz = 2(0.49865) = 0.9973 
-3 o 
The area under the normal probability curve between the ordinates at 
X=u-30 and X= u+30 is 0.9973.In other words, the range X = u+20 
covers 99.73% of the observations (as shown in Fig. ). This is known as 
30 limits of normal distribution and it implies the manufacturing process 
is out of control in case of statistical quality control. 

Thus, the probability that a normal variate X lies outside the range u- 30 
is given as 


P (IX— ul > 30) = P (lz| > 3) = 1 — P (—3 < z < 3) = 1 — 0.9973 = 0.0027 


Thus, in all probability, we should expect a normal variate to lie within 
the range u- 30 though theoretically may range from —o to o. 
Example 
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probability distribution An Intelligence test was administrated to 1000 students. The 
average score of students was 42 with standard deviation of 24. Find 
(a) Number of students exceeding a score of 50 
(b) Numter of students scoring between 30 & 58 
(c) Value of score exceeded by top 100 students. 
Solution: 
In this problem p =42 and o = 24 and let X denote the score obtained 
(a) Number of students exceeding score 50 


NOTES 


p=42 X=50 


As shown in figure we want to find 
P(X>50) i.e. probability of shaded portion 


AtX=50, “24 


P(X>50) =P(Z > 0.334) = 0.5 - P(0< Z < 0.334)= 0.5 - 0.1308= 0.3692 
No of students = 1000 * 0.3692= 369.2 ~ 369 students 


(b) Number of students scoring between 30 and 58 
As shown in figure we want to find 
P(30<X<58) i.e. probability of shaded portion 


X= 30 p= 42 X= 58 


24 
P( Z> -0.5) = P(OSZ; < 0.5)= 0.1915 
—42 


AtX, =58 Z,= ~~ = 0.6667 


P(Z2<0.6667)= P(0 < Z2< 0.6667)=0.2476 
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P(30<X<58)=P(-0.5< Z<0.6667) =0.1915+0.2476= 0.4391 
No of students = 1000 * .4391 = 439.1 ~ 439 students 


(c) Value of score exceeded by top 100 students. 
Let x; be the value of score exceeded by top 100 students, the 


probability of top 100 students = 100/N = 100/1000 = 0.1 such 
that P(X>x;) = 0.1 


_—4? 
— %42 


At X= x,, “a =7). 
From Fig the P(X>x;) shown as shaded region 


—42 
x,—42 


— 
P(X>x,) = P(Z>Z,) =0.1 SP(0<Z<Z)) = 0.4 24 = 1,286 


x1= 72.86 ~73 


Conclusion 
(a) 369 students scored more than 50. 
(b) 439 students scored between 30 & 58. 
(c) Minimum score of top 100 students is 73. 


4.7 SUMMARY 


e Conditions for the binomial probability distribution are 
i. The trials are independent 


ii. The number of trials is finite 


iii. Each trial has only two possible outcomes called 
success and failure. 


iv. The probability of success in each trial is a constant. 
e The parameters of the binomial distributions are n and p 
e The mean of the binomial distribution is np and variance are npq 


e Poisson distribution as limiting form of binomial distribution 
when n is large, p is small and npis finite. 


e The Poisson probability distribution is x = 0,1,2,3... Where A = 
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np 

The mean and variance of the Poisson distribution is À. 
The A is the only parameter of Poisson distribution. 
Poisson distribution can never be symmetrical. 

It is a distribution for rare events. 


Normal distribution is the limiting form of binomial distribution 
when n is large and neither p nor q is small 


4.8 KEY WORDS 


Random Variables, Binomial Distibution, Poisson Distirbution, Normal 
Distribution 


4.9 ANSWER TO CHECK YOUR PROGRESS 


1. 
2. 


3. 


Discrete random variable, continuous random variable, 
Mean = np, SD = ,/npq , variance = npq 


Examples 
o The hourly number of customers arriving at a bank 


o The daily number of accidents on a particular stretch of 
highway 


o The hourly number of accesses to a particular web server 
o The daily number of emergency calls in 108 
o The number of types in a book 


o The monthly number of employees who had an absence in 
a large company 


4.10 QUESTIONS AND EXERCIS 


SHORT ANSWER QUESTIONS: 


1. Define Binomial distribution 
2. Mention the properties of Normal distribution 
3. What are the main characteristics of Poisson distribution 
4. Determine the binomial distribution for which the mean is 4 and 
variance 3. Also find P(X = 15) 
LONG ANSWER QUESTIONS 
1. What is meant by probability distribution of a discrete random 
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variable? 


2. Define Binomial distribution? what are the main characteristics of 
binomial distribution 

3. Write the main characteristics of normal distribution 

4. Fit the Poisson distribution to the following 

xX |0 1 |2 |3 |415 
f | 120 | 82 | 52 |22|4]0 
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UNIT — V ESTIMATION 


Structure 

5.1. Introduction 

5.2 Reasons For Making Estimates 
5.3 Types Of Estimates 

5.4 Point Estimation 

5.5 Interval Estimation 

5.6 Criteria of a Good Estimator 
5.7 Unbiasedness 

5.8 Consistency 

5.9 Efficiency 

5.10 Sufficiency 

5.11 Confidence Intervals 

512 Determining The Sample Size In Estimation 


5.1. INTRODUCTION 


The sampling process is used to draw statistical inference about the 
characteristics of a population or process of interest. On many 
occasions we do not have enough information to calculate an 
exact value of population parameters (such as u, o and p) 

and therefore make the best estimate of this value from the 
corresponding sample statistics (such as x , s, and P ). The need to 
use the sample statistic to draw conclusions about the population 
characteristic is one of the fundamental applications of statistical 
inference in business and economics. 


5.2 Reasons for Making Estimates 

A few applications of statistical estimation are given below : 

A production manager needs to determine the proportion of items being 
manufactured that do not match with quality standards. 

A mobile phone service company may be interested to know the average 
length of a long distance telephone call and its standard 

deviation 

A bank needs to understand consumer awareness of its services and credit 
schemes. 


Any service centre needs to determine the average amount of time a 
customer spends in queue. 


In all such cases, a decision-maker needs to examine the following two 
concepts that are useful for drawing statistical inference about an unknown 
population or process parameters based upon random samples: 


Estimation— a sample statistic to estimate an unknown parameter value 


Hypothesis testing— a claim or belief about an unknown parameter value. 
In this lesson we shall discuss methods to estimate unknown population 
parameter and then to determine the range of values (confidence interval) 
likely to contain the parameter value. 


5.3 TYPES OF ESTIMATES 
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Let us first know the concept of ‘estimate’ as used in Statistics. 
According to some dictionaries, an estimate is a valuation based on 
opinion or roughly made from imperfect or incomplete data. This 
definition may apply, for example, when an individual who has an 
opinion about the competence of one of his colleagues. But, in Statistics 
the term estimate is not used in this sense. In Statistics too the estimates 
are made when the information available is incomplete or imperfect. 
However, such estimates are made only when they are based on sound 
judgement or experience and when the samples are scientifically selected. 


There are two types of estimates that we can make about a population : a 
point estimate and an interval estimate. A point estimate is a single 
number, which is used to estimate an unknown population parameter. 
Although a point estimate may be the most common way of expressing an 
estimate, it suffers from a major limitation since it fails to indicate how 
close it is to the quantity it is supposed to estimate. In other words, a point 
estimate does not give any idea about the reliability of precision of the 
method of estimation used. For instance, if someone claims that 40 
percent of all children in a certain town do not go to the school and are 
devoid of education, it would not be very helpful if this claim is based on 
a small number of households, say, 20. However, as the number of 
households interviewed for this purpose increases from 20 to 100, 500 or 
even 5,000, the claim that 40 percent of children have no school 
education would become more and more meaningful and reliable. This 
makes it clear that a point estimate should always be accompanied by 
some relevant information so that it is possible to judge how far it is 
reliable. 


The second type of estimate is known as the interval estimate. It is a range 
of values used to estimate an unknown population parameter. In case of 
an interval estimate, the error is indicated in two ways: first by the extent 
of its range; and second, by the probability of the true population 
parameter lying within that range. Taking our previous example of 40 
percent children not having a school education, the statistician may say 
that actual percentage of such children in that town may lie between 35 
percent and 45 percent. Thus, he will have a better idea of the reliability 
of such an estimate as compared to the point estimate of 40 percent. 
Estimator and Estimate 
When we make an estimate of a population parameter, we use a sample 
statistic. This sample statistic is an estimator. 


> xi 


For example, the samples mean x = 1=1 


n 


is a point estimator of the population mean u. The value obtained by the 
estimator is known as an estimate. Many different Statistics can be used to 
estimate the same parameter. For example, we may use the sample mean or 
the sample median or even the range to estimate the population mean. The 
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question here is: how can we evaluate the properties of these estimates, 
compare then with one another, and finally, decide which the ‘best’ is? The 
answer to this question is possible only when we have certain criteria that a 
good estimator must satisfy. These criteria are briefly discussed below. 


5.4 POINT ESTIMATION 


In point estimation, a single sample statistic (such as x , s, and p ) is 
calculated from the sample to provide a best estimate of the true value of 
the corresponding population parameter (such as u, O and p ). Such a 
single relevant statistic is termed as point estimator, and the 


value of the statistic is termed as point estimate. For example, we may 
calculate that 10 per cent of the items in a random sample taken from a 
day’s production are defective. The result ‘10 per cent’ is a point estimate 
of the percentage of items in the whole lot that are defective. Thus, until 
the next sample of items is not drawn and examined, we may proceed on 
manufacturing on the assumption that any day’s production contains 10 
per cent defective items. 


5.5 INTERVAL ESTIMATION 


Generally, a point estimate does not provide information about ‘how close 
is the estimate’ to the population parameter unless accompanied by a 
statement of possible sampling errors involved based on the sampling 
distribution of the statistic. It is therefore important to know the precision 
of an estimate before relying on it to make a decision. Thus, decision - 
makers prefer to use an interval estimate that is likely to contain the 
population parameter value. However, it is also important to state ‘how 
confident’ he is that the interval estimate actually contains the parameter 
value. Hence an interval estimate of a population parameter is therefore a 
confidence interval with a statement of confidence that the interval 
contains the parameter value. 


5.6 CRITERIA OF A GOOD ESTIMATOR 


There are four criteria by which we can evaluate the quality of a statistic as 
an estimator. 
These are: unbiasedness, efficiency, consistency and sufficiency. 


5.7 Unbiasedness 


This is a very important property that an estimator should possess. If we 
take all possible samples of the same size from a population and calculate 
their means, the mean u x of all these means will be equal to the mean u of 
the population. This means that the sample mean is an unbiased estimator 
of the population mean u. When the expected value (or mean) of a sample 
statistic is equal to the value of the corresponding population parameter, 
the sample statistic is said to be an unbiased estimator. 

Suppose we take the smallest sample observation as an estimator of the 
population mean p, it can be easily shown that this estimator is biased. 
Since the smallest observation must be less than the mean, its expected 
value must be less than u. Symbolically, E(Xs) < u, where Xs stands for 
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the smallest item and E stands for the expected value. Thus, this estimator 
is biased downwards. The extent of bias is the difference between the 
expected value of the estimator and the value of the parameter. In this case, 
bias is equal to E(Xs)- u. In contrast, the biases for the sample mean x is 
zero. 


5.8 Consistency 

Another important characteristic that an estimator should possess is 
consistency. Let us take the case of the standard deviation of the 
sampling distribution of x . The standard deviation of the sampling 
distribution of sample mean is computed by following formula : 


O, = 


F 


An 


The formula states that the standard deviation of the sampling 
distribution of x decreases as the sample size increases and vice versa. 
When the sample size n increases, the population standard deviation O is 
to be divided by a higher denominator. This results in the reduced value 
of sample standard deviation 0¢. Let us take an example. 


Illustration 13.1: A company has 4,000 employees whose 
average monthly wage comes to Rs.4,800 with a standard 
deviation of Rs.1,200. Let x be the mean monthly wage 
for a random sample of certain employees selected from 
this company. Find the mean and standard deviation of € 
for a sample size of (a) 81, (b) 100 and (c) 180. 


Solution 


From the given information, for the population of all employees, N = 4,000 
u = Rs.4,800 o = Rs.1,200. 


The mean pz of the sampling distribution of the € is u; = u = Rs.4,800. 


As n = 81 and N = 4,000, which gives n/N = 0.01. At this value is less 
than 0.05, the standard deviation of € is obtained by using the formula. 
Substituting the values. 


oO 1,200 1,200 
x= nO = QJ = Q = Rs.133.33 


In this case, n = 100 and n/N = 100/4,000 = 0.025, which is also less 
than 0.05. The mean and the standard deviation € are 


Ue = H = Rs.4,800 
O 1,200 1,200 
x = no,0, = 100 = 10 = Rs.120 
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In this case, n = 180 and n/N = 180/4,000 = 0.045, which again is less 
than 0.05. The mean and the standard deviation € are 


px = H = Rs.4,800 


oO 1,200 1,200 
x= NOx = 180=13. 42 =Rs.89.42 


From the above three sets of calculation, it becomes clear that the mean 
of the sampling distribution of x is always equal to the mean of the 
population regardless of the sample size. But, in case of the standard 
deviation, we find the change. In the given example, we find that 
standard deviation of * decreased from Rs.189.87 to Rs.120 and then to 
Rs.133.33 as the sample size increased from 40 to 100 and then to 180. 


5.9 Efficiency 


Another desirable property of a good estimator is that it should be 
efficient. Efficiency is measured in terms of size of the standard error of 
the statistic. Since an estimator is a random variable, it is necessarily 
characterised by a certain amount of variability. This means that some 
estimates may be more variable than others. Just as bias is related to the 
expected value of the estimator, so efficiency can be defined in terms of 

the variance. In large samples, for example, the variance of the sample 


mean is V( x )=0 /n. As the sample size n increases, the variance of the 
sample mean (V x ) becomes smaller, so the estimator becomes more 
efficient. This criterion, when applied to large samples, gives better 
estimates as compared to the small ones. 

The efficiency of one estimator in relation to another estimator can be 
judged by comparing their sampling variances. Thus, efficiency relates to 
the size of the standard error. Given the same sample size, the statistic that 
has a smaller standard error is preferable as it is efficient in relation to 
another statistic that has a larger standard error. The sampling distribution 
of the mean and the median have the same mean, that is, the population 
mean. However, the variance of the sampling distribution of the means is 
smaller than the variance of the sampling distribution of the medians. As 
such, the sample mean is an efficient estimator of the population mean, 
while the sample median is an inefficient estimator. 


5.10 Sufficiency 


The fourth property of a good estimator is that it should be sufficient. A 
sufficient statistic utilises all the information a sample contains about the 
parameter to be estimated. €, for example, is a sufficient estimator of the 
population mean p. It implies that no other estimator of u, such as the 
sample median, can provide any additional information about the 
parameter u. Likewise, we can say that the sample proportion TT. 
Having looked into properties of a good estimator briefly, a pertinent 
question arises: how can we find estimators with these desirable 
properties? This brings us to the method of maximum likelihood 


5.6.2 METHOD OF MAXIMUM LIKELIHOOD (ML) 
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The maximum likelihood method provides estimators with the desirable 
properties such as efficiency, consistency and sufficiency, which we 
have just discussed. It usually does not give an unbiased estimate. Let 
us take an example to explain this method. 


Example: Suppose we want to estimate the average grade p of a 
large number of students. A random sample of size n = 64 is taken 
and the sample mean x is found to be 90 marks. Now, the 
assumption on which we have to base our reasoning is that the 
random sample of n = 64 is representative of the population. We 
saw how samples that were similar to the population had greater 
probability of being selected. 

Let us now reverse this reasoning as follows: we have before us a random 
sample size n = 64 and x = 90 marks. From which population did it most 
probably come-a population with u = 85, 90 or 95? According to our 
earlier approach, we would think that it most probably came from a 


population with  =90 marks. Thus, it can be concluded that the 
population mean p, based on our sample, is most likely to be U=90 marks. 


A point worth noting is that the population mean 4 is either 90 or not; it 
has only one value. 
Hence, we have used the term likely instead of probably. 


This technique to find the estimators was first used and developed by Sir 
R.A. Fisher in 1922, who called it the maximum likelihood method 


5.11 Confidence Intervals 


There are two types of estimates for each population parameter: the point 
estimate and confidence interval (CI) estimate. For both continuous 
variables (e.g., population mean) and dichotomous variables (e.g., 
population proportion) one first computes the point estimate from a 
sample. Recall that sample means and sample proportions are unbiased 
estimates of the corresponding population parameters. 


For both continuous and dichotomous variables, the confidence interval 
estimate (CI) is a range of likely values for the population parameter based 
on: 


the point estimate, e.g., the sample mean 

the investigator's desired level of confidence (most commonly 95%, but 
any level between 0-100% can be selected) 

and the sampling variability or the standard error of the point estimate. 
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Strictly speaking a 95% confidence interval means that if we were to take 
100 different samples and compute a 95% confidence interval for each 
sample, then approximately 95 of the 100 confidence intervals will contain 
the true mean value (u). In practice, however, we select one random 
sample and generate one confidence interval, which may or may not 
contain the true mean. The observed interval may over- or underestimate u. 
Consequently, the 95% CI is the likely range of the true, unknown 
parameter. The confidence interval does not reflect the variability in the 
unknown parameter. Rather, it reflects the amount of random error in the 
sample and provides a range of values that are likely to include the 
unknown parameter. Another way of thinking about a confidence interval 
is that it is the range of likely values of the parameter (defined as the point 
estimate + margin of error) with a specified level of confidence (which is 
similar to a probability). 


Suppose we want to generate a 95% confidence interval estimate for an 
unknown population mean. This means that there is a 95% probability that 
the confidence interval will contain the true population mean. Thus, P( 
[sample mean] - margin of error < u < [sample mean] + margin of error) = 
0.95. 


The Central Limit Theorem introduced in the module on Probability stated 
that, for large samples, the distribution of the sample means is 
approximately normally distributed with a mean:and a standard deviation 
(also called the standard error): 


For the standard normal distribution, P(-1.96 < Z < 1.96) = 0.95, i.e., there 
is a 95% probability that a standard normal variable, Z, will fall between - 
1.96 and 1.96. The Central Limit Theorem states that for large samples: 


By substituting the expression on the right side of the equation: 


Using algebra, we can rework this inequality such that the mean (u) is the 
middle term, as shown below .then and finally 


This last expression, then, provides the 95% confidence interval for the 
population mean, and this can also be expressed as: 


Thus, the margin of error is 1.96 times the standard error (the standard 
deviation of the point estimate from the sample), and 1.96 reflects the fact 
that a 95% confidence level was selected. So, the general form of a 
confidence interval is: 


point estimate + Z SE (point estimate) 


where Z is the value from the standard normal distribution for the selected 
confidence level (e.g., for a 95% confidence level, Z=1.96). 
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In practice, we often do not know the value of the population standard 
deviation (o). However, if the sample size is large (n > 30), then the sample 
standard deviations can be used to estimate the population standard 
deviation. 


512 Determining the Sample Size in Estimation 


Sample size determination is the act of choosing the number of 
observations or replicates to include in a statistical sample. The sample size 
is an important feature of any empirical study in which the goal is to make 
inferences about a population from a sample. In practice, the sample size 
used in a study is usually determined based on the cost, time, or 
convenience of collecting the data, and the need for it to offer sufficient 
statistical power. In complicated studies there may be several different 
sample sizes: for example, in a stratified survey there would be different 
sizes for each stratum. In a census, data is sought for an entire population, 
hence the intended sample size is equal to the population. In experimental 
design, where a study may be divided into different treatment groups, there 
may be different sample sizes for each group. 


Sample sizes may be chosen in several ways: 


using experience — small samples, though sometimes unavoidable, can 
result in wide confidence intervals and risk of errors in statistical 
hypothesis testing. 

using a target variance for an estimate to be derived from the sample 
eventually obtained, i.e. if a high precision is required (narrow confidence 
interval) this translates to a low target variance of the estimator. 

using a target for the power of a statistical test to be applied once the 
sample is collected. 

using a confidence level, i.e. the larger the required confidence level, the 
larger the sample size (given a constant precision requirement). 
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6.9 Questions and Exercise 

6.10 Further Readings 


6.0 INTRODUCTION 


Hypothesis testing was introduced by Ronald Fisher, Jerzy 
Neyman, Karl Pearson and Pearson’s son, Egon Pearson. Hypothesis 
testing is a statistical method that is used in making statistical decisions 
using experimental data. Hypothesis Testing is basically an assumption 
that we make about the population parameter. 


6.1 OBJECTIVES 


e Understand purpose of Hypothesis testing; 

e Understand the procedure for tests of hypotheses based on large 
samples 

e Solve the problems of testing hypotheses concerning mean(s) and 
proportion(s) based on large samples 


6.2 HYPOTHESIS TESTING ON POPULATION MEAN 
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Two situations may arise out of this, first one is when the population 
variance is known and the second situation is if the population variance is 
unknown. 


6.2.1 POPULATION VARIANCE KNOWN 
Steps: 


1. Let p and o° be respectively the mean and the variance of the 
population under study, where o is known. If Hois an admissible 
value of p, then frame the null hypothesis as Ho: p = poand 


choose the suitable alternative hypothesis from 
a. (i) H1: p# pO (ii) H1: u > p0 (iii) H1: p< po 
2. Let (X1, X2, ...,Xn) be a random sample of n observations drawn 
from the population, where nis large (n > 30). 


3. Let the level of significance be a. 


: oe X- z 
4. Consider the test statistics Z= a underHo. Here Xrepresents 


the sample mean, The approximate sampling distribution of the 
test statistics under Hois the N(0,1) distribution 


5. Calculate the value of Z for the given sample (x1, x2, ..., xn) as 


_ 7-H 
T o/vn' 


6. Find the critical value, ze, corresponding to a and H1 from the 
following table 


Alternative Hypothesis (H1) | u # “0 | u > uO | u < u0 


Critical Value (ze) Za/2 Za Za 


7. Decide on HO choosing the suitable rejection rule from the 
following table corresponding to H/. 


Alternative Hypothesis (Hı) | u#H0O | u> u0 | u< u0 


Rejection Rule |zo|2 Zo2 | Zo >Za | Z0<—Za 
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Example: 


A company producing batteries finds that mean life span of the 
population of its batteries is 200 hours with a standard derivation of 15 
hours. A sample of 100 batteries randomly chosen is found to have the 
mean life span of 195 hours. Test, at 5% level of significance, whether 
the mean life span of the batteries is significantly different from 200 
hours. 


Solution: 


Step 1 :Let uw and o represent respectively the mean and standard 
deviation of the probability distribution of the life span 
of the batteries. It is given that o = 15 hours. The null 
and alternative hypotheses are 


Null hypothesis: Ho: u = 200 


i.e., the mean life span of the batteries is not 
significantly different from 200 hours. 


Alternative hypothesis: Hz: u + 200 


i.e., the mean life span of the batteries is significantly 
different from 200 hours. 


It is a two-sided alternative hypothesis. 
Step 2 : Data 

The given sample information are 

Sample size (n) = 100, Sample mean (x) = 195 hours 
Step 3 : Level of significance 

a=5% 
Step 4: Test statistic 

The test statistic is , under Ho 


Under the null hypothesis Ho, Z follows the N(0,1) 
distribution. 


Step 5 : Calculation of Test Statistic 


The value of Z under Hois calculated from 
_ -p _ 195-200 __ 
Tøn 15/100 3.33 
Thus; | z 0 | = 3.33 
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Step 6: Critical value 


Since H,is a two-sided alternative, the critical value at 
a= 0.05 is Ze= Z0.025= 1.96. 


Step 7 : Decision 


Since His a two-sided alternative, elements of the 
critical region are determined by the rejection rule |zo| > 
ze. Thus, it is a two-tailed test. For the given sample 
information, the rejection rule holds i.e., |zo| = 3.33 
>ze= 1.96. Hence, His rejected in favour of H: u # 
200. Thus, the mean life span of the batteries is 
significantly different from 200 hours. 


6.2.2 POPULATION VARIANCE UNKNOWN 


Steps: 


1. 


Let u and o’be respectively the mean and the variance of the 
population under study, where ois unknown. If wis an 
admissible value of u, then frame the null hypothesis as Ho: u = 
pand choose the suitable alternative hypothesis from 


(i) H:: u £ wi) Hi: u > wii) Hi: u < w 


Let (Xi, X2, ...,X,) be a random sample of n observations 
drawn from the population, where n is large (n > 30). 


Specify the level of significance, a. 


í i X- 
Consider the test statistic Z= Ti under H,, where X and S are 
the sample mean and sample standard deviation respectively. It 
may be noted that the above test statistic is obtained from Z by 


substituting S for o. 


The approximate sampling distribution of the test statistic under 
H.is the N(0,1)distribution. 


. Calculate the value of Z for the given sample (x7, x2, ...,Xn) as 


Z= sat . Here,X and s are respectively the values of Xand S 


calculated for the given sample. 


Find the critical value, z., corresponding to a and H,from the 
following table 


Alternative Hypothesis (H1) | u # “0 | u > uO | u < u0 


107 


Test of Hypothesis 


NOTES 


Self-Instructional Material 


Test of Hypothesis 


NOTES 


Self-Instructional Material 


Critical Value (ze) Za/2 Za Za 


7. Decide on Ho choosing the suitable rejection rule from the 
following table corresponding to H7. 


Alternative Hypothesis (Hi) | u#H0 | u> u0 | u< pO 


Rejection Rule |zo|2 Zw2 | zo >Za | Z0<—Za 


Example: 


A car manufacturing company desires to introduce a new model 
car . The company claims that the mean fuel consumption of its new 
model car is lower than that of the existing model of the car, which is 57 
kms/litre. A sample of 100 cars of the new model car is selected 
randomly and their fuel consumptions are observed. It is found that the 
mean fuel consumption of the 100 new model car is 60 kms/litre with a 
standard deviation of 3 kms/litre. Test the claim of the company at 5% 
level of significance. 


Solution: 


Step 1 :Let the fuel consumption of the new model car be assumed to 
be distributed according to a distribution with mean and 
standard deviation respectively u and o. The null and 
alternative hypotheses are 


Null hypothesis Ho: u = 57 


i.e., the average fuel consumption of the company’s new 
model car is not significantly different from that of the 
existing model. 


Alternative hypothesis Hz: u > 57 


i.e., the average fuel consumption of the company’s new 
model car is significantly lower than that of the existing 
model. In other words, the number of kms by the new model 
csr is significantly more than that of the existing model car. 


Step 2 : Data: 
The given sample information are 
Size of the sample (n) = 100. Hence, it is a large sample. 
Sample mean (X )= 30 


Sample standard deviation(s) = 3 
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Step 3 : Level of significance 
a=5% 
Step 4: Test statistic 
The test statistic under Ho is 
X-u. 
~ s/Vn 


Since n is large, the sampling distribution of Z under Ho is 
the N(O,1) distribution. 
Step 5 : Calculation of Test Statistic 


The value of Z for the given sample information is 
calculated from 


Step 6 : Critical Value 


Since H; is a one-sided (right) alternative hypothesis, the 
critical value at a = 0.05 is ze= zo.05= 1.645. 


Step 7 : Decision 


Since Hz is a one-sided (right) alternative, elements of the 
critical region are defined by the rejection rule zo >Ze= Zo0.05. 
Thus, it is a right-tailed test. Since, for the given sample 
information, zo= 10 >ze= 1.645, Hois rejected. 


6.3 DIFFERENCE BETWEEN MEANS OF TWO 
POPULATIONS 


6.3.1 POPULATION VARIANCE KNOWN 
Steps: 


1. Let mand o%, be respectively the mean and the variance of 
Population -1. Also, let yand o7jbe respectively the mean and 
the variance of Population -2 under study. Here o°, ando’, are 
known admissible values. 


Frame the null hypothesis as HO: ux = uyand choose the suitable 
alternative hypothesis from 


a. (i) H1: ux £ y(i) H1: wx >py(iii) H1: ux<py 
2. Let (X1, X2, ..., Xm) be a random sample of m observations 
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drawn from Population-1 and (Y1, Y2, ..., Yn) be a random 
sample of n observations drawn from Population-2, where m and 
n are large(i.e., m > 30 and n > 30). Further, these two samples 
are assumed to be independent. 


Specify the level of significance, a. 


(X-Y)— (ux—pY ) 


ox g^ 
Jex qoy 
m n 


and Y are respectively the means of the two samples described in 
Step-2. 


Consider the test statistic Z = under Ho, where X 


The approximate sampling distribution of the test statistic 
(x-Y) 


Z= E under Ho (i.e., x= ur) is the N(0,1) distribution. 
o*x 4 oy 

It may be noted that the test statistic, when o°% = 0%, = 
o,is Z = 22 


Ee 
Calculate the value of Z for the given samples (x/, x2, ...,Xm) and 
(X-Y) 


(Y1, Y2, ..., Yn) as Z0 = ==. 
p 


Here, x and yare respectively the values of X and Yfor the given 
samples. 


Find the critical value, ze, corresponding to a and H/ from the 
following table 


Alternative Hypothesis (H1) | wx4 ur | px>py | ux<py 


Critical Value (ze) Za/2 Za Za 


Make decision on Ho choosing the suitable rejection rule from the 
following table corresponding to H7. 


Alternative Hypothesis (H1) | wxé uy | px>ur | ux<ur 


Rejection Rule |Zol= Zw2 | zo >Za | Zo <-Za 


Example: 
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Performance of students in a national level Olympiad exam was 
studied. The scores secured by randomly selected students from two 
districts, viz., Di and D2 of a State were analyzed. The number of students 
randomly selected from Dz: and D2 are respectively 1000 and 1600. 
Average scores secured by the students selected from Dı and D2 are 
respectively 116 and 114. Can the samples be regarded as drawn from the 
identical populations having common standard deviation 27 Test at 5% 
level of significance. 


Solution: 


Step 1 :Let uX and uY be respectively the mean scores secured in the 
national level Olympiad examination by all the students from 
the districts D/ and D2 considered for the study. It is given 
that the populations of the scores of the students of these 
districts have the common standard deviation o = 2. The null 
and alternative hypotheses are 


Null hypothesis: HO: x= py 


i.e., average scores secured by the students from the study 
districts are not significantly different. 


Alternative hypothesis: H1: ux £ uy 
i.e., average scores secured by the students from the study 
districts are significantly different. It is a two-sided alternative. 
Step 2 : Data 
The given sample information are 
Size of the Sample-1 (m) = 1000 


Size of the Sample-2 (n) = 1600. Hence, both the samples 
are large. 


Mean of Sample-1 (x ) = 116 
Mean of Sample-2 (y ) = 114 


Step 3 : Level of significance 
a = 5% 


Step 4 : Test statistic 
(X-Y) 


p 1 
oe 
mn 


The test statistic under the null hypothesis Hois Z = 
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Test of Hypothesis Since both m and n are large, the sampling distribution of 
None Z under Hois the N(0, 1) distribution. 


Step 5 : Calculation of Test Statistic 


The value of Z is calculated for the given sample 
information from 


z= g- _ (116-114) 


sd i Cnn 
m'n 1000 ` 1600 


49.628 


Step-6 : Critical value 


Since H;is a two-sided alternative hypothesis, the critical value at 
a = 0.05 is ze = z0.025 = 1.96. 


Step-7 : Decision 


Since H; is a two-sided alternative, elements of the critical 
region are defined by the rejection rule |zo| > ze= zo.025. For 
the given sample information, |zo| = 49.628 >ze= 1.96. It 
indicates that the given sample contains sufficient evidence 
to reject Ho. Thus, it may be decided that Ho is rejected. 
Therefore, the average performance of the students in the 
districts Di and Dz in the national level Olympiad 
examination are significantly different. Thus the given 
samples are not drawn from identical populations. 


6.3.2 POPULATION VARIANCE UNKNOWN 


Steps: 


1. Let ux and o? be respectively the mean and the variance of 
Population -1. Also, let uy and g? be respectively the mean and 
the variance of Population -2 under study. Here o°, ando’, are 
known admissible values. 


Frame the null hypothesis as HO: ux = yand choose the suitable 
alternative hypothesis from 


i @ Hl: ux # pyGii) H1: px>py(iii) H1: px<py 


2. Let (X1, X2, ..., Xm) be a random sample of m observations 
drawn from Population-1 and (Y1, Y2, ..., Yn) be a random 


sample of n observations drawn from Population-2, where m and 
Self-Instructional Material 
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n are large(i.e., m > 30 and n > 30). Further, these two samples Test of Hypothesis 
are assumed to be independent. 
NOTES 


3. Specify the level of significance, a. 


(X-Y)— -Y ) 


Sx |S? 
sx sy 
m n 


4. Consider the test statistic Z = under Ho, 


5. (i.e., UX= uY) 


i.e, the above test stastistics is obtained from Z considered in 
the test described by substituting S?x and S?y respectively for 
o’, and 0%, 


The approximate sampling distribution of the test statistic 
Z =" under Ho is the N(0,1) distribution. 


Sx SA 
[ex sy 
m n 


6. Calculate the value of Z for the given samples (x), x2, ...,Xm) and 
(x-Y) 
(Y1, y2, ..., Yn) as Z0 = (as 


n 


Here, ¥ and yare respectively the values of X and Yfor the given 
samples. 


Also, s?x and syare respectively the values of S*x and S?y for the 
given samples. 


7. Find the critical value, ze, corresponding to a and H/ from the 
following table 


Alternative Hypothesis (H1) | wx4 ur | px>py | ux<py 


Critical Value (ze) Za/2 Za Za 


8. Make decision on Ho choosing the suitable rejection rule from the 
following table corresponding to H7. 


Alternative Hypothesis (H1) | wxé uy | ux>ur | ux<ur 


Rejection Rule |zo|2 Zw2 | zo >Za | Z0<—Za 


6.4 TEST OF HYPOTHESES FOR POPULATION 
PROPORTION 
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Steps: 


. Let P denote the proportion of the population possessing the 


qualitative characteristic (attribute) under study. If pis an 
admissible value of P, then frame the null hypothesis as Ho:P = p, 
and choose the suitable alternative hypothesis from 


(i) Hi: P # p.(ii) Hz: P > p, (iii) Hz: P < po 


Let p be proportion of the sample observations possessing the 
attribute, where n is large, np>5andn(1l—p)>5. 


Specify the level of significance, a. 


Consider the test statistic under Ho. Here, £ Here, Q = 1 — 


T 


The approximate sampling distribution of the test statistic under 
HỌ is the N(0, 1) distribution. 


Calculate the value of Z under Ho for the given data as J , qo 


= 1- po. 


Choose the critical value, ze, corresponding to a and H1 from the 
following table 


Alternative Hypothesis (Hı) 


Critical Value (ze) 


Make decision on HO choosing the suitable rejection rule from the 
following table corresponding to H1. 


Alternative Hypothesis (H1) | P # po P > po | P < po 


Rejection Rule |zol= Zw2 | Z0 >Za | Zo <-Za 


Example: 


A survey was conducted among the students of a city to study 


their preference towards consumption of chocolate and ice-cream. 
Among 2000 randomly selected students, it is found that 1120 are 
chocolate and the remaining are ice-cream. Can we conclude at 1% level 
of significance from this information that both chocolate and ice-cream 
are equally preferred among the students in the city? 
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Solution: 
Step 1 


Step 2: 


Step 3: 


Step 4: 


Step 5: 


Step 6: 


Step 7: 


:Let P denote the proportion of students in the city who 
preferred to have chocolate. Then, the null and the 
alternative hypotheses are 


Null hypothesis: Ho : = 0.5 


Le., it is significant that both chocolate and ice-cream are 
preferred equally in the city. 


Alternative hypothesis: Ho : # 0.5 


i.e., preference of chocolate and ice-cream are not 
significantly equal. It is a two-sided alternative hypothesis. 


Data 
The given sample information are 
Sample size (n) = 2000. Hence, it is a large sample. 


No. of chocolate consumer = 1120 


Sample proportion (p) =n = 0.56 


Level of significance 


a=1% 


Test statistic 


Since n is large, np= 1120 > 5 and n(1 — p) = 880 > 5, the 
test statistic under the null hypothesis, is Z = — 


Its sampling distribution under HO is the N(0, 1) distribution. 
Calculation of Test Statistic 


The value of Z can be calculated for the sample information 
from 


=P = 12 = 5.3763 


PQ 0.5 x 0.5 
n 2000 


Critical value 


Since H1 is a two-sided alternative hypothesis, the critical 
value at 1% level of significance is Za2 = 20.005 = 2.58. 


Decision 


Since H1 is a two-sided alternative, elements of the critical 
region are determined by the rejection rule |z0| > ze. Thus it 
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is a two-tailed test. Since |z0| = 5.3763 >ze= 2.58, reject HO 
at 1% level of significance. Therefore, there is significant 
evidence to conclude that the preference of chocolate and 
ice-cream are different. 


6.5 DIFFERENCE BETWEEN TWO PROPORTIONS 


Steps: 


1: Let Px and Py denote respectively the proportions of Population-1 
and Population-2 possessing the qualitative characteristic 
(attribute) under study. Frame the null hypothesis as Ho: 
PX=PY and choose the suitable alternative hypothesis from 


(i) Hı: Px# Py (ii) Hı: Px> Py (iii) Hı: Px < Py 


2: Let Px and Py denote respectively the proportions of the samples of 
sizes m and n drawn from Population-1 and Population-2 
possessing the attribute, where m and n are large (i.e., m > 
30 and n > 30). Also, mpx> 5, m( 1- px) > 5, npy> 5, n( 1- 
Py) > 5 . Here, these two samples are assumed to be 
independent. 


3: Specify the level of significance, a. 
(px—py)— (PX— PY ) 


= under Ho. Here 
pac 


4: Consider the test statistic Z = 


p= mpxtapy , @ = 1 - p . The approximate sampling 


m+n 
distribution of the test statistic 


under Ho is N(O,1) distribution. 
(PX— PY ) 
eas 


6: Choose the critical value, Ze, corresponding to a and H; from the 
following table 


5: Calculate the value of Z for the given data as Z = 


Alternative Hypothesis (H1) | Px# Py | Px> Py | Px < Py 


Critical Value (ze) Za/2 Za ee 


7: Decide on Hochoosing the suitable rejection rule from the following 
table corresponding to H; 
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Alternative Hypothesis (H1) | Px# Py | Px> Py | Px< Py 


Rejection Rule \Zol= Zw2 | Zo>Za | 20<—Za 


Example: 


Solution: 


A study was conducted to investigate the interest of students 
in private schools. Among randomly selected 1000 students 
from City-1, 800 persons were found to be private school. 
From City-2 , 1600 persons were selected randomly and 
among them 1200 students are from private school. Do the 
data indicate that the two cities are significantly different 
with respect to prevalence of private school among the 
students? Choose the level of significance as a = 0.05. 


Step1 :Let PX and PY be respectively the proportions of private school 


students in City-1 and City-2. Then, the null and alternative 
hypotheses are 


Null hypothesis: Ho: Px = Py 


i.e., there is no significant difference between the 
proportions of private school students in City-1 and City-2. 


Alternative hypothesis: Hı: Px4 Py 


i.e., difference between the proportions of private school 
students in City-1 and City-2 is significant. It is a two-sided 
alternative hypothesis. 


Step 2 : Data 


The given sample information are 


City | Sample size | Sample proportion 


City 1 | m= 1000 Px = 800/ 1000 = 0.80 


City 2 | n= 1600 Py = 1200/1600 =0.75 


Here m > 30 and n > 30, mp, = 800 > 5, 


m( l- px) = 200 > 5, npy = 1200 > 5, n( 1- py) =400>5. 


Step 3 : Level of significance 


a=5% 


Step 4: Test statistic 
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The test statistic under the null hypothesis is 


Z= (px-py)— c= PY ) 
pass 


Step 5 : Calculation of Test Statistic 


mpx+npy 


where p= ae a 


q=1-p 


The value of Z for given sample information is calculated 
from 


(PX-PY) _ (0.80—0.75 ) 


Paes JOmo 


Step 6 : Critical value 


Z= = 2.0764 


Since H/ is a two-sided alternative hypothesis, the critical 
value at 5% level of significance is ze= 1.96. 


Step 7 : Decision 


Since HO is a two-sided alternative, elements of the critical 
region are determined by the rejection rule |z0| >ze. Thus, it 
is a two-tailed test. For the given sample information,z. = 
2.0764 > = 1.96. Hence, HO is rejected. We can conclude 
that the difference between the proportions of private school 
students in City-1 and City-2 is significant. 


CHECK YOUR PROGRESS - 1 
1. What is hypothesis testing? 
2. A large sample theory is applicable when 


3. When is standard error of the sample proportion under Ho 


6.6 SUMMARY 


e Ifthe number of sample observations is greater than or equal to 
30, it is called large sample. 

e For hypotheses testing based on two samples, if the sizes of both 
the samples are greater than or equal to 30, they are called large 
samples. 

e For testing population proportion, the sampling distribution of the 
test statistic is N(0, 1), only when n >30, np> 5 and n (1 - p) > 5. 

e For testing equality of two population proportions, the sampling 
distribution of the test statistic is N (0, 1) only when m >30, n 
>30, mpX> 5, m(1— pX) > 5, npY > 5 and n(1- pY) > 5. 


6.7 KEY WORDS 
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Hypothesis, Population, Proportion 


6.8 ANSWER TO CHECK YOUR PROGRESS 


1. Hypothesis testing is a statistical method that is used in making 
Statistical decisions using experimental data 
2. When n=30 
P 
3. |22 
n 


6.9 QUESTION AND EXERCISE 


SHORT ANSWER QUESTIONS 


1. List the possible alternative hypotheses and the corresponding 
rejection rules followed in testing equality of two population 
means. 

2. Specify the alternative hypotheses and the rejection rules 
prescribed for testing equality of two population proportions 

LONG ANSWER QUESTIONS 

1. Explain the general procedure to be followed for testing of 
hypotheses. 

2. Explain the procedure for testing hypotheses for population mean, 
when the population variance is unknown. 

3. How will you formulate the hypotheses for testing equality of 
means of two populations, when the population variances are 
known? Describe the method. 

4. Describe the procedure for testing hypotheses concerning equality 
of means of two populations, assuming that the population 
variances are unknown. 

5. Give a detailed account on testing hypotheses for population 
proportion. 

6. Explain the procedure of testing hypotheses for equality of 


proportion of two populations. Interest of XII Students on 
Residential Schooling was investigated among randomly selected 
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students from two regions. Among 300 students selected from 
Region A, 34 students expressed their interest. Among 200 
students selected from Region B, 28 students expressed their 
interest. Does this information provide sufficient evidence to 
conclude at 5% level of significance that students in Region A are 
more interested in Residential Schooling than the students in 
Region B? 


6.10 FURTHER READINGS 


Į: 


2: 


3. 


Statistics (Theory & Practice) by Dr. B.N. Gupta. 
SahityaBhawanPublishersand Distributors (P) Ltd., Agra. 
Statistics for Management by G.C. Beri. Tata McGraw Hills 
PublishingCompany Ltd., New Delhi. 

Business Statistics by Amir D. Aczel and J. Sounderpandian. Tata 
McGraw HillPublishing Company Ltd., New Delhi. 

Statistics for Business and Economics by R.P. Hooda. MacMillan 
India Ltd.,New Delhi. 


. Business Statistics by S.P. Gupta and M.P. Gupta. Sultan Chand 


and Sons.,NewDelhi. 

Statistical Method by S.P. Gupta. Sultan Chand and Sons., New 
Delhi. 

Statistics for Management by Richard I. Levin and David S. 
Rubin. PrenticeHall of India Pvt. Ltd., New Delhi. 


120 


UNIT 7 - CHI — SQUARE TEST 


Structure 
7.0 Introduction 


7.1 Objectives 

7.2 Characteristics of Chi -Square Test 
7.3 Uses of Chi -Square Test 

7.4 Steps of Chi —Square Test 

7.5 Summary 

7.6 Questions 


7.0 INTRODUCTION 


A chi-squared test, also written as y? test, is any statistical 
hypothesis test where the sampling distribution of the test statistic is a 
chi-squared distribution when the null hypothesis is true. Without other 
qualification, ‘chi-squared test' often is used as short for Pearson's chi- 
squared test. 

The chi-squared test is used to determine whether there is a 
significant difference between the expected frequencies and the observed 
frequencies in one or more categories. 

Chi square test is applied in statistics to test the goodness of fit to 
verify the distribution of observed data with assumed theoretical 
distribution. Therefore, it is a measure to study the divergence of actual 
and excepted frequencies. it has great use in statistics, specially in 
sampling studies, where we except a doubted coincidence between 
actual and excepted frequencies, and the extent to which the difference 
can be ignored, because of fluctuations in sampling. 


7.1 OBJECTIVES 


The student will be able to 
e Understand the purpose for using chi-square test 
e Understand the procedures for Analysis of variance 
e Understand the characteristics and of chi-square test 
e Solve problems to test the hypothesis whether the population 
has a particular variance using chi-square test 


7.2 CHARACTERISTICS OF y TEST 


1. Test is based on events or frequencies, whereas in theoretical 
distribution, the test is based on mean and standard deviation. 

2. To draw inferences, this test is applied, specially testing the 
hypothesis but not useful for estimation. 

3. The test can be used between the entire set of observed and 
excepted frequencies. 

4. For every increase in the number of degree of freedom, a new 2 
distribution is formed. 

5. It is a general purpose test and as such is highly useful n research. 
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7.3 USES OF y TEST 


7° Test of goodness of fit 

Through the test we can find out the deviations between the 
observed values and excepted values. Here we are not concerned with the 
parameters but concerned with the form of distribution. Karl Pearson has 
developed a method to test the difference between the theoretical value 
(hypothesis) and the observed value. A Greek letter x? is used to describe 
the magnitude of difference between the fact and theory. 
The y may be defined as, 


O = Observed Frequencies 
E = Excepted Frequencies 


7.4 STEPS OF x? TEST 


1. A hypothesis is established along with the significance level. 


2. Compute deviation between observed value and excepted value 
(O-E). 

3. Square the deviations calculated (O-E)’. 

4. Divide the (O-E)’ by its excepted frequency. 

5. Add all the values obtained in step 4. 

6. Find the value of y’ table at certain level of significance, usually 


5% level. 

If the calculated value of x? is greater than the table value of y°, at 
certain level of significance, we reject the hypothesis. If the 
computed value of x“ less than the table value, at a certain degree of 
level of significance, it is said to be non-significant. This implies that 
the discrepancy between the observed frequencies may be due to 
fluctuations in the simple sampling. 

Example: 

In a certain ample of 2000 families, 1400 families are consumers of tea. 
Out of 1800 Hindu families, 1236 families consume tea. Use y'test and 
state whether there is any significant difference between consumption of 
tea among Hindu and non-Hindu families. 


Hindu Non — Hindu Total 
Consuming Tea 1236 164 1400 
Non — Consuming | 564 36 600 
Tea 
Total 1800 200 2000 
Solution 


On tabulation of the information in a 2x2 contingency table, we 


get: 


Chi -Square Test 


Observed Frequencies nee 
Hindu Non — | Total 
Hindu 
Consuming Tea 1236 164 1400 
Non — Consuming | 564 36 600 
Tea 
Total 1800 200 2000 
Excepted Frequencies 
Hindu Non — | Total 
Hindu 
Consuming Tea 1260 140 1400 
Non — Consuming | 540 60 600 
Tea 
Total 1800 200 2000 
Calculation of X 
O E O-E_ | (O-E) | (O-E)/E 
1236 | 1260 -24 576 0.457 
564 540 24 576 1.068 
164 140 24-24 576 4.114 
36 60 576 9.600 
y(O-E)7/ 
E=15.239 


d.f is 1, Table value of r2 0.05 for 1 d.f = 3.841. 

For a contingency table, 2x2 table, the degree of freedom is 
V= (c-1) (r-1) = (2-1) (2-1)= 1. 

The calculated value of y~ 15.239 is higher than the table value i.e., 3.841; 
therefore the null hypothesis is rejected. 


Hence, the two communities differ significantly as far as 
consumption of a tea is concerned. 


7.5. SUMMARY 


e The uses of distribution are testing the specified variance of a normal 
population, testing goodness of fit and testing independence of 
attributes 


e Through the test we can find out the deviations between the observed 
values and excepted values. Here we are not concerned with the 


parameters but concerned with the form of distribution. Self-Instructional Material 


123 


Chi -Square Test 


NOTES 


Self-Instructional Material 


7.6 Questions & Exercises 


SHORT ANSWER QUESTION: 


1 


2: 
3. 
4. 


Define chi square test 

What are the condition for the validity of chi square test 
Five applications of chi square test 

What are the uses of chi square 


LONG ANSWER QUESTION: 
1. Explain the steps of chi-square test 
2. Write down steps for testing the significance of goodness of fit 


7.11 FURTHER READINGS 


1. 


2. 
3: 


Spiegel, Murray R.: Theory and Practical of Statistics., 
London 

McGraw Hill Book Company. 

Yamane, T.: Statiscs: An Introductory Analysis, New York, 
HarperedRow Publication 

R.P. Hooda: Statistic for Economic and Management 
McMillan IndiaLtd. 

G.C. Beri: Statistics for Mgt., TMA 

J.K. Sharma: Business Statistics, Pearson Education 
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Unit — VIII F — TEST 


Structure 
8.1 Introduction 
8.2 Analysis of Variance (ANOVA) 
8.3 Assumptions in Analysis of Variance 
8.4 Basic steps in Analysis of Variance 
8.4.1 One Way ANOVA 
8.4.2 Two Way ANOVA 
8.5 Summary 
8.6 Key Words 
8.7 Answer to Check Your Progress 
8.8 Questions and Exercise 


8.9 Further Readings 


8.1 Introduction 


An F-test is any statistical test in which the test statistic has an F- 
distribution under the null hypothesis . It is most often used when 
comparing statistical models that have been fitted to a data set, in order to 
identify the model that best fits the population from which the data were 
sampled. 


8.2 ANALYSIS OF VARIANCE (ANOVA) 


Analysis of variance (ANOVA) is a collection of statistical models 
and their associated estimation procedures (such as the "variation" among 
and between groups) used to analyze the differences among group means 
in a sample. ANOVA was developed by statistician and evolutionary 
biologist Ronald Fisher. The ANOVA is based on the law of total 
variance, where the observed variance in a particular variable is 
partitioned into components attributable to different sources of variation. 
In its simplest form, ANOVA provides a statistical test of whether two or 
more population means are equal, and therefore generalizes the t-test 
beyond two means. 


8.3 ASSUMPTIONS IN ANALYSIS OF VARIANCE 


1. Each of the samples is strained from a normal distribution. 
2. The variances for the population from which samples have been 
drained are equal. 
3. The variation of each value around its own grand mean should be 
independent for each value. 
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8.4 BASIC STEPS IN ANALYSIS OF VARIANCE 


Determine 
1. One estimate of the population variance from the variance among 
the sample means. 
2. Determine a second estimate of the population variance from the 
variance within the sample. 
3. Compare these two estimates if they are approximately equal in 
value, accept the null hypothesis. 


8.4.1 One-Way Anova 

In statistics, one-way analysis of variance (abbreviated one-way 
ANOVA) is a technique that can be used to compare means of two or 
more samples (using the F distribution). This technique can be used only 
for numerical response data, the "Y", usually one variable, and numerical 
or (usually) categorical input data, the "X", always one variable, hence 
"one-way". 

The ANOVA tests the null hypothesis that samples in all groups are 
drawn from populations with the same mean values. To do this, two 
estimates are made of the population variance. These estimates rely on 
various assumptions. 

The ANOVA produces an F-statistic, the ratio of the variance 
calculated among the means to the variance within the samples. When 
there are only two means to compare, the t-test and the F-test are 
equivalent; the relation between ANOVA and t is given by F = t. An 
extension of one-way ANOVA is two-way analysis of variance that 
examines the influence of two different categorical independent variables 
on one dependent variable. 

Example: 

In order to determine whether there is significant difference in the 
durability of 3 makes of computers, samples of size 5 are selected from 
each make and the frequency of repair during the 1“ year of purchase is 
observed. The results are as follows: 


Makes 

I II Ill 
4 7 6 
6 9 4 
8 11 6 
9 12 3 
7 5 2 


In view of the above data, what conclusion can you draw? 


Solution: 
Null Hypothesis Ho= there is no significant difference in the durability of 
3 makes of computers. 


Computer I Computer IT Computer ITI 
Xi Xi X- xy X; X; 
4 16 7 49 6 36 
6 36 9 81 4 16 
8 64 11 121 6 36 
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9 81 12 144 3 9 
7 49 5 25 2 4 
§X1=34 | $X1=246 | $ X2=44 | ¥X7=420 | YX3=21 | YX3"=101 
Step- 1 
Sum of all items (T) = $ X:+} X2+} X; 
= 34+44+21 
=99 
Step — 2 
Correction factor (C.F) = T= (99)= 653.4 
N 15 
Step -3 
TSS = Sum of Squares of all the items — C.F 
=5FX +X +X; - T 
N 
= 246+420+101-653.4= 113.6 
Step- 4 
SSC = Sum of Squares between samples — C.F 
=0X) ŒX? QX - CF 
n + n +n 
= (34) (447 (21% - 653.4 
5 ë + e+. ee 5 
= 231.2 + 387.2 + 88.5 — 653.= 53.5 
Step —5 
MSC = Sum of squares between samples 
d.f 
= 53.5 
2 
= 26.75 
Step — 6 
SSE =Total sum of squares — Sum of Squares between samples 
= 113.6 — 53.5 
= 60.1 
Step — 7 
MSE = Sum of squares within samples 
d.f 
= 60.1 
12 
= 5.00 
ANOVA TABLE 
Source of | Sum of | Degrees of | Mean F - ratio 
variations squares freedom Squares 
Between SSC =53.5 | 3-1=2 MSC= SSC 
samples d.f 
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= F. = MSC 
26.75 MSE 
Within SSE =60.1 | 15-3=12 MSE= SSE =5.35 
samples d.f 
= 5.00 


Tabulated value of F for V;=2 and V.2.=12 at 5% level of 
significance is 3.88. Fray=3.88. Calculated value of F is F. = 5.35. Since 
F. >Frap. We reject the null hypothesis Ho, There is significant difference 
in the durability of 3 makes of computers. 

8.4.2 TWO-WAY ANOVA 

The two-way ANOVA compares the mean differences between 
groups that have been split on two independent variables (called factors). 
The primary purpose of a two-way ANOVA is to understand if there is 
an interaction between the two independent variables on the dependent 
variable. 

This is a term which stems from agricultural research in which 
several variables or treatments are applied to different blocks of land for 
repetition or replication of the experimental effects. The advantages of a 
completely randomized experimental design are as follows. 

(a) Easy to lay out. 

(b) Allows flexibility. 

(c) Simple statistical analysis. 

(d) The lot of information due to missing data is smaller than with 

any other design. 
But this design is usually suited (i) only for small number of treatments 
and (ii) for homogeneous experimental material. 
Example: 
There varieties A, B, C of crop are tested in a randomized block design 
with four replications. The plot yields in pounds are as follows 


Varieties | Yields 

1 2 |3 4 
A 6 5 |8 9 
B 8 4 |6 9 
C 7 6 |10 i6 


Solution 
Null hypothesis Ho: There is no significant difference between varieties 
(rows) and between yields,(blocks). 


Varieties | Yields 
1 2 3 4 Total 

A 6 4 6 6 24 

B 7 5 8 9 28 

C 8 6 10 |9 32 

Total 21 15 |24 |24 84 
Step -1 

Grand total (T) = 84 

Step — 2 
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Correction factor (C.F) = z = ae = 588 
Step — 3 
SSC = Sum of squares between blocks (columns) 
= (21)* + (15) + (24)°+ (24) -C.F 
3 3 3 3 
= 606 — 588 
= 18 
Step- 4 
SSR = Sum of squares between varieties (Rows) 
= (24)° + (28% + (32) — C.F 
4 4 4 
= 596 — 588 
=8 
Step —5 
TSS = Total sum of squares — C.F 
[(6)+(7)°+(8)°+(4)"+(6) +(5) +(8)°+(6)"-+(10)°+(6)°+(9)°+(9)] - 588 
= 624 — 588 
= 36 
Step — 6 
SSE = Residual sum of squares 
= TSS-(SSC+SSR) 
= 36 — (184+8) = 10 
Step- 7 
d.f = v3 = (c-1) (r-1) 
= (3) (2) 
=6 
ANOVA TABLE 
Source of |Sum_ of | Degree of | Mean F-ratio 
variation squares freedom | Squares 
Between SSC= 18 | c-l MSC=SSC_ | F= 
Blocks 4-1=3 d.f MSC 
(Columns) =6 MSE 
=3.6 
Between SSR=8 r-1 MSR=SSR_ | Fr= 
Varieties 3-1=2 d.f MSR 
(Rows) =4 MSE 
=2.4 
Residual SSE=10 (r-1)(c-1) | MSE=SSE 
=6 d.f 
=1.667 


(i) The tabulated value of F for (3,6) d.f at 5 % level of significance 
is 4.76.Frap=4.76. since F.<Fiab, We accept the null hypothesis Ho. 
That is there is no significant difference between yields. 


(ii) The tabulated value of F for (2,6) d.f at 5 % level of significance 
is 5.14. Frap=5.14. since Fr<Fiab, we accept the null hypothesis Ho. 
That is there is no significant difference between varieties. 
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CHECK YOUR PROGRESS - 1 


1. By which other name is the Chi-Square goodness of fit test 
known? 
2. What type of data do you need in Chi-Square test? 
3. What symbol is used to represent Chi-Square? 
4. What is Analysis of Variance? 
5. What is the main purpose of Two way ANOVA test? 
6. The variation of each value around its own grand mean should be 
for each value 
8.5 SUMMARY 


e The uses of distribution are testing the specified variance of a normal 
population, testing goodness of fit and testing independence of 
attributes 

e Analysis of variance (ANOVA) is a collection of statistical 
models and their associated estimation procedures used to analyze 
the differences among group means in a sample 

e One-way analysis of variance (abbreviated one-way ANOVA) is a 
technique that can be used to compare means of two or more samples 

e The two-way ANOVA compares the mean differences between 
groups that have been split on two independent variables 


8.6 KEY WORDS 


Chi-square, Analysis of Variance, One way method, Two way method 


8.7 ANSWER TO CHECK YOUR PROGRESS 


1. One sample chi square 
2. Categorical 

3. x2 

4. 


ANOVA provides a statistical testof whether two or more 

population means are equal, and therefore generalizes the t- 

test beyond two means 

5. The primary purpose of a two-way ANOVA is to understand if 
there is an interaction between the two independent variables 
on the dependent variable 

6. Independent 


8.8 QUESTIONS AND EXERCISE 


SHORT ANSWER QUESTION: 
1. Define chi square test 
2. What are the condition for the validity of chi squate test 
3. Five applications of chi square test 
4. What is analysis of variance? 
5. What are the assumptions of ANOVA 
LONG ANSWER QUESTION: 
1. Explain the steps of chi-square test 
2. Write down steps for testing the significance of goodness of fit 
3. Write the model ANOVA table for one way classification 
4. Compare one way and two way ANOVA 
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UNIT IX - CORRELATION 
ANALYSES 


Structure 
9.0 Introduction 
9.1 Objectives 
9.2 Correlation 
9.3 Linear Correlation 
9.4 Types of Correlation 
9.5 Scatter Diagram 
9.6 Two — Way table 
9.7 Pearson’s Correlation Coefficient 
9.8 Spearman’s rank Correlation Coefficient 
9.9 Properties of Correlation Coefficient 
9.10 Summary 
9.11 Key Words 
9.12 Answer to Check Your Progress 
9.13 Questions and Exercise 
9.14 Further Readings 


9.0 INTRODUCTION 


In our day to day life, we find many situations when a mutual 
relationship exists between two variables i.e., with change (fall or rise) in 
the value of one variable there may be change (fall or rise) in the value of 
other variable For example, as price of a commodity increases the 
demand for the commodity decreases. In the increase in the levels of 
pressure, the volume of a gas decreases at a constant temperature. These 
facts indicate that there is certainly some mutual relationships that exist 
between the demand of a commodity and its price, and pressure and 
volume. Such association is studied in correlation analysis. The 
correlation is a statistical tool which measures the degree or intensity or 
extent of relationship between two variables and correlation analysis 
involves various methods and techniques used for studying and 
measuring the extent of the relationship between the two variables. 


9.1 OBJECTIVES 


After studying this chapter students will be able to understand 


e Understand the concept of scatter Diagram 
e Concept of Karl Pearson’s correlation co-efficient and the 
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methods of computing it. 
e Spearman’s Rank correlation co-efficient 


9.2 CORRELATION 


Correlation is a statistical technique which measures and analyses the 
degree or extent to which two or more variables fluctuate with reference 
to one another. It denotes the inter-dependence amongst variables. The 
degrees are expressed by a coefficient which ranges between -1 to +1. 
The direction of change is indicated by + or - signs; the former, refers to 
the movement in the same direction and the later, in the opposite 
direction. An absence of correlation is indicated by zero. Correlation thus 
expresses the relationship through a relative measure of change and it has 
nothing to do with the units in which the variables are expressed. 


9.3 LINEAR CORRELATION 


If the amount of change in one variable tends to bear constant ratio 
to the amount of change in the other variable then the correlation is said 
to be linear. For example, 


X 5 10 15 20 25 


Y 90 170 230 310 420 


9.4 TYPES OF CORRELATION 


There are three important types of correlation. They are 


1. Positive and Negative correlation 
2. Simple, Partial and Multiple correlation 
3. Linear and Non-Linear correlation 


1. Positive and Negative correlation 


Correlation is classified according to the direction of change in 
the two variables. In this regard, the correlation may either be positive or 
negative. 


Positive correlation refers to the change (movement)of variables 
in the same direction. Both the variables are increased or decreased in the 
same direction, it is called positive correlation. It is otherwise called as 
direct correlation. For example, a positive correlation exists between ages 
of husband and wife, height and weight of a group of individuals, 
increase in rainfall and production of paddy, increase in the offer and 
sales. 


Negative correlation refers to the change (movement) of variables 
in the opposite direction. In other words, an increase (decrease) in the 
value of one variable is followed by a decrease by a decrease (increase) 
in the value of the other is said to be negative correlation. It is otherwise 
called increase correlation. For example, a negative correlation exists 
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between price and demand, yield of crop and price. 


The following expels illustrate the concept of positive correlation 


NOTES and negative correlation. 
Positive correlation 
X 5 7 9 11 16 20 28 
y 20 26 35 37 48 50 55 
Negative Correlation 
X 14 17 23 35 46 
y 16 12 10 9 5 


2.Simple, Partial and Multiple Correlations 


Simple correlation is a measure used to determine the strength 
and the direction of the relationship between two variables, X and Y. A 
simple correlation coefficient can range from —l to 1. However, 
maximum (or minimum) values of some simple correlations cannot reach 
unity (i.e., 1 or —1). 


When we study only two variables, the relationship is described 
as simple correlation, example, quantity of money and price level, 
demand and price, etc. But in a multiple correlation we study more than 
two variables simultaneously; example, the relationship of price, demand 
and supply of a commodity. 


The study of two variables excluding some other variables is 
called partial correlation. For example, we study price and demand, 
eliminating the supply side. 


3. Linear and Non-Linear Correlation 


Linear correlation is a measure of the degree to which two 
variables vary together, or ameasure of the intensity of the association 
between two variables. 


If the ratio of change between two variables is uniform, then the 
there will be linear correlation between them. Consider the following. 


X 6 12 18 24 


Y 5 10 15 20 


The ratio of change between the variables is same. 


In a curvilinear or non linear correlation, the amount of change in 


Self-Instructional Material one variable does not bear a constant ratio of the amount of change in the 
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other variables. The graph of non-linear or curvilinear relationship will 
form a curve. 


In majority of cases, we find curvilinear relationship, which is a 
complicated one, so we generally assume that the relationship between 
the variables under the study is linear. In social sciences, linear 
correlation is rare, because the exactness is not as perfect as in natural 
sciences. 


CHECK YOUR PROGRESS- 1 


1. What is correlation? 
2. Define linear correlation? 
3. List out the different types of correlation? 


9.5 SCATTER DIAGRAM 


It is simple and attractive method of diagrammatic representation. 
In this method, the given data are plotted on a graph sheet in the form of 
dots. The x variables are plotted on the horizontal axis and y variables on 
the vertical axis. Now we can know the scatter or concentration of the 
various points. This will show the type of correlation. 


r=+1 = 3 


L 
n > L 
x a 


Perfect Negative Correlation 


Perfect Positive Correlation 


Low Degree of Positive Correlation Low Degree of Negative Correlation 
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High Degree of Positive Correlation High Degree of Negative Correlation 
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No Correlation No Correlation i 


9.6 TWO-WAY TABLE 


Atwo-way table (also called a contingency table) is a useful tool for 
examining relationships between categorical variables; the entries in the 
cells of two-way table can be frequency counts or relative frequencies 
(just like a one-way table). 


Dance Sports TV Total 
Men 2 10 8 20 
Women 16 6 8 30 
Total 18 16 16 50 


Above a two-way table shows the favourite leisure activities for 
50 adults-20 men and 30 women. Because entries in the table are 
frequency counts, the table is a frequency table. 


9.7 PEARSON’S CO-EFFICIENT OF CORRELATION 


Karl Pearson (1867-1936), the British biometrician suggested this 
method. It is popularly known as Pearson’s co-efficient of correlation. It is 
mathematical method for measuring the magnitude of linear relationship 
between two variables. 


Pearson's correlation coefficient is the test statistics that measures the 
statistical relationship, or association, between two continuous variables. It is 
known as the best method of measuring the association between variables of 
interest because it is based on the method of covariance. 
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(a)Arithmetic mean Method 


yxy 


"= JEx2 Dye 


Example: 


Find Pearson’s Co-efficient of correlation from the following data 


Solution 


Sales | 15 | 18 


22 | 28 


32 | 46 


52 


Profit | 52 | 66 


78 | 87 


96 | 125 
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Let the sales be denoted by x and the profit by y. 


Computation of coefficients of correlation 


X xX x’ Y Y_ y? XY 
-X -Y 
15 15.43 238.98 52 -40.14 1611.22 619.36 
18 12.43 154.50 66 -26.14 683.30 324.92 
22 -8.43 71.06 78 14.14 199.94 119.20 
28 -2.43 5.90 87 5.14 26.42 12.49 
32 1.57 2.46 96 3.86 14.90 6.06 
46 15.57 242.42 125 32.86 1079.78 511.63 
52 21.57 465.26 141 48.86 2387.30 1053.91 
Sx | x= | ¥x?=1179.68 | Yy=645 | Yy= | Zy = yxy 
=743 0"! 0.02 | 6,002.86 | =2647.57 
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X=)x/N =213/7=30.43 
Y=Yy/N =645/7 =92.14 
Y'x’=1179.68,) y’=6002.86,) xy=2647.57 


„2I 2647.57 2647.57 


2647.57 2647.57 
~ 34.35x77.48 2661.44 


Therefore, there is a high degree positive correlation between the x and y. 


9.8 SPEARMEN’S RANK CORRELATION CO- 
EFFICIENT 


In statistics, Spearman's rank correlation coefficient or Spearman's rho, 
named after Charles Spearman and often denoted by the Greek letter 
P(rho)or as rs is a nonparametric measure of rank correlation (statistical 
dependence between the rankings of two variables). It assesses how well 
the relationship between two variables can be described using a 
monotonic function. 


The Spearman correlation between two variables is equal to the 
Pearson correlation between the rank values of those two variables; while 
Pearson's correlation assesses linear relationships, Spearman's correlation 
assesses monotonic relationships (whether linear or not). If there are no 
repeated data values, a perfect Spearman correlation of +1 or —1 occurs 
when each of the variables is a perfect monotone function of the other. 


aoe: 
n(n*—-1) 


Spearman's coefficient is appropriate for both continuous and discrete 
ordinal variables. Both Spearmen’s can be formulated as special cases of 
a more general correlation coefficient. 


r=1 


Example: 


Two faculty members ranked 12 candidates for scholarships. Calculate 
the spearman rank correlation coefficient. 


Candidate |1/2 3 4/5 |6 |7 |8 |9 | 10 
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Professor A |8/12/}6 4/9 |15/8 |7 | 16) 13 


Professor B | 9 | 16 | 10 | 8 | 14 | 19 | 12 | 11 | 20 | 17 


Solution 
Rx R, d= R,- Ry d? 
8 9 -1 1 
12 16 -4 16 
6 10 -4 16 
4 8 -4 16 
9 5 4 16 
15 10 5 25 
8 7 1 1 
7 11 -4 16 
16 15 1 1 
13 18 -5 25 
y d’=133 
ee on “ice 6(133) 7 _ 798 
n(n2—1) 10(100—1) 990 
= 1-0.8060 
r =0.194 


CHECK YOUR PROGRESS - 2 


4. What are the uses of scattered diagram? 
5. Write the formula to calculate spearman’s rank correlation? 


9.9 PROPERTIES OF CORRELATION CO-EFFICIENT 


1. Coefficient of Correlation lies between -1 and +1:The coefficient 
of correlation cannot take value less than -1 or more than one +1. 
Symbolically,-1<=r<=+ 1 or |r | <1. 
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Coefficients of Correlation are independent of Change of Origin: 
This property reveals that if we subtract any constant from all the 
values of X and Y, it will not affect the coefficient of correlation. 
Coefficients of Correlation possess the property of symmetry: 
The degree of relationship between two variables is symmetric. 
Coefficient of Correlation is independent of Change of Scale: 
This property reveals that if we divide or multiply all the values 
of X and Y, it will not affect the coefficient of correlation. 


. The value of the co efficient of correlation shall always lie 


between +1 and -1. 

When r = + 1, then there is perfect positive correlation between 
the variables. 

When r = - 1, then there is perfect negative correlation between 
the variables. 

When r =0, then there is no relationship between the variables. 


The third formula given above, that is 
DXy 


r= 
Vf =x? dy? 


It is easy to calculate, and it is not necessary to calculate the 


standard deviation of X and Y series separately. 


9.10 SUMMARY 


The term correlation refers to the degree of relationship 
between two or more variables. 


Scatter diagram is a graphic device for finding correlation 
between two variables. 


Karl Pearson correlation coefficient r(x,y) = pe 
VvX2 Yy2 


Correlation coefficient r lies between —1 and 1. (i.e) -I<r<l 
When r=+1 , then the correlation is perfect positive 
When r=—1 , then the correlation is perfect negative 


When r=0, then there is no relationship between the 
variables, (i.e) the variables are uncorrelated. 


Spearman’s Rank correlation deals with qualitative 
characteristics. 


9.11 KEY WORDS 


Correlation , Spearman’s Rank correlation, Pearson correlation, 
Correlation Coefficient , Scattered Diagram 


9.12 ANSWER TO CHECK YOUR PROGRESS 
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5. 


. The term correlation refers to the degree of relationship between 


two or more variables 

Linear correlation is a measure of the degree to which two 
variables vary together, or ameasure of the intensity of the 
association between two variables 


. Positive and Negative correlation ,Simple, Partial and Multiple 


correlation,Linear and Non-Linear correlation 
Scatter diagram is a graphic device for finding correlation 
between two variables 
6y:D2 
n(n?—1) 


r,=1 


9.13 QUESTIONS AND EXERCISE 


SHORT ANSWER QUESTIONS 


1. Calculate the coefficient of correlation from the following data: 
XX=50, LY=-30, UX2 =290, VY2 =300, UXY=—115, N=10 

2. The following data pertains to the marks in subjects A and B ina 
certain examination. Mean marks in A = 39.5, Mean marks in B= 
47.5 standard deviation of marks in A =10.8 and Standard 
deviation of marks in B= 16.8. coefficient of correlation between 
marks in A and marks in B is 0.42. Give the estimate of marks in 
B for candidate who secured 52 marks in A. 

3. What is scattered diagram and explain it? 

LONG ANSWER QUESTIONS 

1. Arandom sample of recent repair jobs was selected and estimated 
cost, actual cost were recorded. Calculate the value of Spearman’s 
correlation 
Estimated cost | 70 | 68 | 67 | 55 | 60 | 75 | 63 | 60 | 72 
Actal cost 65 | 65 | 80 | 60 | 68 | 75 | 62 | 60 | 70 

2. Distinguish between Karl Pearson’s coefficient and Spearman’ s 
correlation coefficient 

3. Explain the types of correlation with examples 

9.14 FURTHER READINGS 

1. Statistics (Theory & Practice) by Dr. B.N. Gupta. SahityaBhawan 
Publishers andDistributors (P) Ltd., Agra. 

2. Statistics for Management by G.C. Beri. Tata McGraw Hills 
Publishing CompanyLtd., New Delhi. 

3. Business Statistics by Amir D. Aczel and J. Sounderpandian. Tata 
McGraw HillPublishing Company Ltd., New Delhi. 

4. Statistics for Business and Economics by R.P. Hooda. MacMillan 
India Ltd., NewDelhi. 

5. Business Statistics by S.P. Gupta and M.P. Gupta. Sultan Chand 


and Sons.,NewDelhi. 
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UNIT X - SPEARMAN’S RANK 
CORRELATION 


Structure 


10.0 Introduction 
10.1 Objectives 
10.2 Regression 
10.3 Linear Regression 
10.4 Types of Regression 
10.4.1Regression Equation of Y on X 
10.4.2 Regression Equation of X on Y 
10.5 Curve fitting by the Method of Least square 
10.6 Derivations of Regression Equation 
10.7 Properties of Correlation Coefficient 
10.8 Summary 
10.9 Key Words 
10.10 Answer to Check Your Progress 
10. 11 Questions and Exercise 
10.12 Further Readings 


10.0 INTRODUCTION 


Regression means stepping back or going back. It was first used by 
Francis Galton in 1877. He studied the relationship between the 
height of father and their sons. The study revealed that 


e Tall fathers have tall sons and short fathers have short sons. 

e The mean height of the sons of tall father is less than mean height 
of their fathers. 

e The mean height of sons of short fathers is more than the mean 
height of their fathers. 


The tendency to going back was called by Galton as ‘Line of 
Regression’. This line describing the average relationship 
between two variables is known as the line of Regression. 


In statistical modelling, regression analysis is a set of statistical 
processes for estimating the relationships among variables. It 
includes many techniques for modelling and analyzing several variables, 
when the focus is on the relationship between a dependent variable and 
one or more independent variables (or ‘predictors'). More specifically, 

regression analysis helps one understand how the typical value of 
the dependent variable (or ‘criterion variable’) changes when any 
one of the independent variables is varied, while the other independent 
variables are held fixed. 


Regression analysis is widely used for prediction and forecasting, 
where its use has substantial overlap with the field of machine 
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learning. 


10.1 OBJECTIVES 


After studying this chapter students will be able to understand 
e Concept of Regression and Regression coefficients 
e Types of regression equations 


e Regression lines both x on y and y on x 


10.2 REGRESSION 


Regression analysis refers to assessing the relationship between the 
outcome variable and one or more variables. The outcome variable is 
known as the dependent or response variable and the risk elements, and 
cofounders are known as predictors or independent variables. The 
dependent variable is shown by “y” and independent variables are shown 
by “x” in regression analysis. 


10.3 LINEAR REGRESSION 


Linear regression attempts to model the relationship between two 
variables by fitting a linear equation to observed data. One variable is 
considered to be an explanatory variable, and the other is considered to 
be a dependent variable. For example, a modeler might want to relate the 
weights of individuals to their heights using a linear regression model. 


10.4 TYPES OF REGRESSION EQUATIONS 


The Regression Equation is the algebraic expression of the regression 
lines. It is used to predict the values of the dependent variable from the 
given values of independent variables. As there are two regression lines, 
there are two regression equations. For the two variables X and Y, there 
are two regression equations. They are. 


o Regression equation of X on Y. 
o Regression equation of Yon X. 


10.4.1 Regression Equation of X on Y 
The straight line equation is X=a+by 


Here a andb are unknown constants, which determines the 
position. The constant a is the intercept on the other value; the constant b 
is the slope; the following two normal equations are derived; 


yx =na+b>y 
Dxy =ayx + byy” 


The Regression equation X on Y is used to find out the values of X for 
given value of Y. 


10.4.2 Regression Equation of Y on X 
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The straight line equation is Y=a+bx 


The following two normal equationsare derived 


$ y=na+b} x 


given value of x. 


yxy=ay xt+by x 


The Regression equation Y on X is used to ascertion the value of y for a 


Example: 
Find out the regression equation, x on y and y on x from the following 
data: 
X | 15 | 20 | 25 | 30 | 35 | 40 | 45 
y |8 | 14] 20) 26 | 32 | 38 | 44 
Solutions 
x y x’ y? xy 
15 8 225 64 120 
20 14 400 196 280 
25 20 625 400 500 
30 26 900 676 780 
35 32 1225 1024 1120 
40 38 1600 1444 1520 
45 44 2025 1936 1960 
Yx=210 | $y=182 | Yx?=7000 | $ y”=5740 | Yxy=6300 


Yx=210; Yy=182; $ x=7000; $ y=5740; Yxy=6300 


Regression equation x on y is 


Hence 


() 


y=a+by 
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>x=natb)y 
Dxy=ayxtbyy” 
210=7a + 182b 


6300=182a+5740b 


Multiplying equation (1) by 26 

5460=182a+4732b 
(3) 

6300 = 182a + 5740b 
(4) 


Deducting (3) from Equation (2) 
6300=182a+5740b 


(4) 
5460=182a+4732b 
(3) 
O O 0O 
840=0 + 1008b 
Therefore, b=840/1008=0.83 


Substituting the value of b in Eq.(1) 
210=7a+(182x0.83) 
210=7a+151.06 
7a + 151.06=210 
7a=210-151.06 


7a=58.94 
a= 8.42 
Hence, x=a+by 
x=8.42 + 0.83 y 
Regression Eq of y on x y=a+bx 
Hence, Yy=Natbyy* 
182=7a+210b 
(1) 
6300=210a+700b 
(2) 
Multiplying Eq.(1) by 30 
5460=210a+6300b 
(3) 
6300=210a+7000b 
(4) 
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Deducting Eq.(4)from Eq.(3) 
(3) 


(4) 


Substituting the value of b in Eq.(1) 


Therefore, 


SQUARE 


6300=210a+7000b 
5460=210a+6300b 


840=0  +700b 
700=840 
b=840/700 
=1.2 


182=7a+(120x1.2) 
182=7a+252 
7a+252=182 
7a=182-252 
a= -70 
a= -10 


y= -10+1. 


10.5 CURVE FITTING BY THE METHOD OF LEAST 


1. Curve Fitting 


Curve fitting is the process of introducing mathematical 
relationships between dependent and independent variables in the form of 


an equation for a given set of data. 


2. Method of Least Squares 


The method of least squares helps us to find the values of 
unknowns a and b in such a way that the following two conditions are 


satisfied: 


e The sum of the residual (deviations) of observed values of Y and 
corresponding expected (estimated) 


zero. X} (Y-Y^=0 $} (Y-Y^N=0. 


e The sum of the squares of the residual (deviations) of observed values 
of YY and corresponding expected values (Y^Y^\ should be at 


least }(Y-Y*)2d(Y-Y%)’. 
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values of Y will 


3. Fitting of a Straight Line: 


A straight line can be fitted to the given data by the method of 
least squares. The equation of a straight line or least square line 
is Y=a+bX, where a and b are constants or unknowns. 


To compute the values of these constants we need as many equations as 
the number of constants in the equation. These equations are called 
normal equations. In a straight line there are two constants a and b so we 
require two normal equations. 


Normal Equation for ‘a’ XY = na+ bX 


Normal Equation for ‘b’ XXY = ay’ X+ b} X2 
Example: 


The given example explains how to find the equation of a straight 
line or a least square line by using the method of least square, which is 
very useful in statistics as well as in mathematics. 


X 1 2 3 4 |5 


y l2 l5 l3 [847 


Solution 

x Y XY x’ 1.1+1.3X | y. F 
1 2 2 1 2.4 -0.4 
2 5 10 4 3.7 1.3 
3 3 9 9 5.0 -2 

4 8 32 16 6.3 1.7 
5 7 35 25 7.6 -0.6 

YX=15 | YY=25 | YXY=88 |JX’=55 | Trend values | ¥(Y- Ý =) 


The equation of least square line Y=a + bx 


Normal equation for ‘a’ $} Y=na+ b ÈX 25 
= 5a + 15b}. Y = na+bf X 25=5a+15b ---- (1) 
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Normal equation for ‘b XY XY=a} X+bY X2 88=15at55b ----(2) 


Eliminate a from equation (1) and (2),multiply equation (2) by 3 and 
subtract from equation (2). Thus we get the values of a and b. 


Here a=1.1 and b=1.3X 


For the trends values, put the values of X in the above equation 


10.6 DERIVATIONS OF REGRESSION EQUATION 


1. When deviations are taken from Arithmetic means of X and Y 


The above method of finding out the regression equation is 
difficult. Instead, we can use the deviations of X and Y observations 
from their respective averages. 


(i) Regression equation of X on Y 


The regression equation of Y on X can also be expressed in the following 
form- 


X-Xe=r ox /oy (Y-Y) 


Here, X is the average of X observations and Y is the average of Y 
observations. 


rox/oy is the regression coefficient of X on Y and is denoted by bxy. bxy 
measures the amount of change in X corresponding to a unit change in Y. 


r ox/oy = bxy = Yxy/Yy’ 
Where x=(X-X) and y=(Y-Y) 
(ii) Regression equation of Y on X 


The regression equation of Y on X can also be expressed in the following 
form- 


Y - Y = r oy/ox(X - X) 


roy/ox is the regression coefficient of Y on X and is denoted by byx. byx 
measures the amount of change in Y corresponding to a unit change in X. 


r oy/6x = byx = (Sxy)/ExX? 
We can calculate the coefficient of correlation which is the geometric 
mean of the two regression coefficients (bxy &byx) Le. 


r= V(bxy) x(Dyx) 
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2. When deviations are Taken from Assumed Mean 


When instead of using actual means of X and Y observations, we use any 
arbitrary item (in the observation) as the mean. 


We consider taking deviations of X and Y values from their respective 
assumed means. 


The formula for calculating regression coefficient when regression is Y 
on X is as follows: 


ov _ 5 Y(dx)(dy) 2S) 
ox  Yàdyynz-292 

Where dx = (X-A,) {Ax = assumed mean of X observations} and dy = 

(Y-Ay) {Ay = assumed mean of Y observations} 


The formula for calculating regression coefficient when regression is X 
on Y is as follows: 


d d 
a Y(dx)(dy)-F OZ) 
oy Y  X(dyynz-2 


In the case of Grouped frequency distribution, the regression 
coefficients are calculated from the bivariate frequency table (or 
correlation table). 


The formula for calculating regression coefficient (in case of grouped 
frequency distribution) when regression is of Y on X is as follows- 


Zf (dx)(dy)-Ż Eiin ; 
PS op O a OO N O 
OX yx Lfdxjn2- Èl hx 
Where hx= class interval of X variable and hy = class interval of Y 
variable 


The formula for calculating regression coefficient (in case of grouped 
frequency distribution) when regression is of X on Y is as follows- 


d d 
Ox _ _ ¥ f (dx) (dy) Sear ie 


— z= = ESN x 
oy ü L(fdy)az-2Lwn hy 


Check your Progress - 1 


1. What are regression coefficients? 
2. What is the formula used to calculate assumed mean? 
3. What is method of least square? 


10.7 PROPERTIES OF REGRESSION EQUATION 
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The constant ‘b’ in the regression equation (Y.= a + bX) is called as 
the Regression Coefficient. It determines the slope of the line, i.e. the 
change in the value of Y corresponding to the unit change in X and 
therefore, it is also called as a “Slope Coefficient.” 


1. 


The correlation coefficient is the geometric mean of two 
regression coefficients. Symbolically, it can be expressed as: 


r= |b,,+b 


xy + yx 

The value of the coefficient of correlation cannot exceed unity 
i.e. 1. Therefore, if one of the regression coefficients is greater 
than unity, the other must be less than unity. 

The sign of both the regression coefficients will be same, i.e. they 
will be either positive or negative. Thus, it is not possible that one 
regression coefficient is negative while the other is positive. 

The coefficient of correlation will have the same sign as that of 
the regression coefficients, such as if the regression coefficients 
have a positive sign, then “r” will be positive and vice-versa. 

The average value of the two regression coefficients will be 
greater than the value of the correlation. Symbolically, it can be 
represented as 


by t by 
2 


The regression coefficients are independent of the change of 
origin, but not of the scale. By origin, we mean that there will be 
no effect on the regression coefficients if any constant is 
subtracted from the value of X and Y. By scale, we mean that if 
the value of X and Y is either multiplied or divided by some 
constant, then the regression coefficients will also change. 


pT 


Thus, all these properties should be kept in mind while solving 
for the regression coefficients. 


10.8 SUMMARY 


Linear regression attempts to model the relationship between two 
variables by fitting a linear equation to observed data 

Regression analysis refers to assessing the relationship between 
the outcome variable and one or more variables 

A straight line can be fitted to the given data by the method of 
least squares 

The constant ‘b’ in the regression equation (Y.= a + bX) is called 
as the Regression Coefficient. It determines the slope of the line, 
i.e. the change in the value of Y corresponding to the unit change 
in X and therefore, it is also called as a “Slope Coefficient.” 
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10.9 KEY WORDS 


Regression, Linear regression, Types of regression coefficient, Properties 
of regression coefficient , straight line, Regression equation, straight line 
Deviations, Actual Mean, 


10.10 ANSWER TO CHECK YOUR PROGRESS 


1. Regression analysis refers to assessing the relationship between 
the outcome variable and one or more variables 


2. When regression is Y on X 


(dx) (XY dy) 
= bys ¥(dx) (dy) - 2A) 


ox L(dxja2 -20 


When regression is X on Y 


d d 
Ol bo Y(dx)(dy)-2 
oy Y(dyyn2-E 


3. The method of least squares helps us to find the values of 
unknowns a and b in such a way that the following two conditions 
are satisfied: 


o >¥(Y-Y%)=0 ¥(Y-Y*%)=0. 
o >(Y-Y’)2y(Y-Y*%)*. 


10.10 QUESTIONS AND EXERCISE 


SHORT QUESTION ANSWER 


1. What are regression coefficients? 

2. Define regression and write down the two regression equations 
3. Describe different types of regression 

4. What are the uses of regression analysis? 


LONG QUESTION AND ANSWER 


1. Explain the principle of least squares 

2. State the properties of regression equations 

3. For 5 observations of pairs of (X, Y) of variables X and Y the 
following results are 
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obtained. YX=15, DY=25, YX2=55, VY2=135, UXY=83. Find the 
equation of the 

lines of regression and estimate the values of X and Y if Y=8 ; 
X=12. 

Using the following information you are requested to (i) obtain 
the linear regression 

of Y on X (ii) Estimate the level of defective parts delivered when 
inspectionexpenditure amounts to Rs.28,000 2X=424, XY=363, 
2X2 =21926, ŁY2 =15123,2XY=12815 , N=10. Here X is the 
expenditure on inspection, Y is the defectiveparts delivered 
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UNIT — IX BUSINESS FORECASTING 


Structure 

11.1 Introduction 

11.2 The Objectives of Forecasting 

11.3 Prediction, projection and forecasting 

11.4 Characteristics of forecasting are as follows 
11.5 Steps in Forecasting 

11.6 Methods of Business Forecasting 


11.1 Introduction 


Business forecasting is a method to predict the future, where the future is 
narrowly defined by economic conditions. It combines information 
gathered from past circumstances with an accurate picture of the present 
economy to predict future conditions for a business. 


It refers to techniques such as taking a prospective view of how the 
economy is likely to turn out in the short-term. Its use is critical for 
businesses whenever the future is uncertain. The more they can focus on 
the probable outcome, the more success the organization has as it moves 
forward. 


11.2 The Objectives of Forecasting 


In the narrow sense, the objective of forecasting is to produce better 
forecasts. But in the broader sense, the objective is to improve 
organizational performance—more revenue, more profit, increased 
customer satisfaction. Better forecasts, by themselves, are of no inherent 
value if those forecasts are ignored by management or otherwise not used 
to improve organizational performance. 


A wonderfully sinister way to improve forecast accuracy (while ignoring 
more important things like order fill, customer satisfaction, revenue 
generation, and profit) was provided by Ruud Teunter of Lancaster 
University, at the 2008 International Symposium on Forecasting. Teunter 
compared various forecasting methods for a data set of 5,000 items having 
intermittent demand patterns. (Intermittent patterns have zero demand in 
many or most time periods.) 


Teunter found that if the goal is simply to minimize forecast error, then 
forecasting zero in every period was the best method to use! (The zero 
forecast had lower error than a moving average, exponential smoothing, 
bootstrapping, and three variations of Croston’s method that were tested.) 
However, for proper inventory management to serve customer needs, 
forecasting zero demand every period is probably not the right thing to do. 


A similar point was made last fall in a Foresight article by Stephan Kolassa 
and Roland Martin (discussed in "Tumbling Dice"). Using a simple dice 
tossing experiment, they showed the implications for bias in commonly 
used percentage error metrics. What makes this important to management 
is that if the sole incentive for forecasters is to minimize MAPE, the 
forecaster could do best by purposely forecasting too low. This, of course, 
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could have bad consequences for inventory management and customer 
service. 


11.3 Prediction, projection and forecasting 


Forecast is scientific and free from intuition and personal bias, whereas 
prediction is subjective and fatalistic in nature. 

Forecasting is an extrapolation of past into the future while prediction is 
judgmental and takes into account changes taking place in the future. 
Therefore, prediction is utilized more in business and economics while 
forecasting takes place in weather and earthquakes. 

Predicting is saying or telling something before the event while forecasting 
is done on the basis of analysis of the past. 

Forecasting is still not a complete science as there are chances of error. 
Concept of Forecasting 

Forecasting is a process of making predictions about the future course of a 
business or a company based on trend analysis and past and present data. 


So essentially data is collected and studied about the business, and analysis 
is done to forecast future scenarios that are likely to occur. Hence 
forecasting is an important tool in the process of business planning. 


Forecasts are usually done by managers (at different levels, Statisticians, 
experts, economists, consultants etc. They involve a lot of data collecting 
(both past and present) and data analysis. 

There is also the use of scientific techniques and methods. But at the end of 
the day, forecasting is not an exact science. There is always some guessing 
and observations involved. This is when the experience and knowledge of 
these experts come into play. 


11.4 characteristics of forecasting are as follows: 


Forecasting is strictly concerned with future events only 

It analysis the probability of a future event or transaction occurring or 
happening 

It involves analysis of data from the past and the present 

Forecasting uses scientific techniques and methods to make such forecasts 
But it also involves certain guesswork and observations 


11.5 Steps in Forecasting 


Identifying and Understanding the Structure 

There are almost indefinite factors that can affect the future of a business. 
Identifying all these factors is neither possible nor desirable. So to make an 
accurate forecast, the managers have to identify the factors on which to 
focus. So internal and external factors must be studied to identify the 
strategic factors of the business. 


Forecasting the future 

Now that the foundation is laid the next step is to make an accurate and 
scientific forecast. This involves both scientific tools and techniques and 
also professional judgment and observations. The forecast is not a 
foolproof plan, only a guiding map for the future. 


Analysis of Deviations 
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No forecast will be completely accurate. The differences or deviations 
from the forecasts should be analyzed and studied. This will help in 
making more accurate forecasts in the future. 


Adapting the Forecasts Procedure 

In forecasting, the skill and professional judgement required are gained by 
experiences and practice. The forecast process is fine-tuned with every 
cycle. So we can learn from our mistakes and shortcomings and keep 
improving on the forecasting procedures. 


11.6 Methods of Business Forecasting 


business forecasting. The methods are: 1. Bottom-up Method 2. Top-down 
Method 3. Historical Method 4. Deductive Method 5. Joint Opinion 
Method 6. Scientific Business Forecasting. 


Business Forecasting: Method # 1. 
Bottom-up Method: 


Under this method various departments of an enterprise collect their own 
information/data and prepare their own forecasts. On the basis of these 
forecasts, the forecast for the firm as a whole is then undertaken. Thus, the 
responsibility of successful forecasting lies directly with various 
departments and people in the organisation. 


Business Forecasting: Method # 2. 
Top-down Method: 


This method is just reverse of the direct or bottom-up method. In this 
method the forecast for the industry/business as a whole is ascertained first 
and then the particular forecasts for the various activities of the business 
are established. The process of forecasting is, thus, indirect and the 
responsibility for success in forecasting mainly lies with the top levels of 
management. 


Business Forecasting: Method # 3. 
Historical Method: 


This method refers to the projection of trends on the basis of past events. 
The historical sequence of events is analysed as a basis for understanding 
the present situation and forecasting the future trends. The past recurring 
trends are associated with the corresponding cause and effect phenomenon 
in the future. 


The important advantages of this method include: 
(i) The past information or records can be easily obtained; and 


(ii) Present information is also not ignored. 

However, the main limitation of this method is that the future trends may 
deviate drastically from the normal path indicated by the past events. 
Further, it may not be possible to find trend or develop correlation between 
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cyclic movements of past data and other variables which have bearing 
upon them. 

Business Forecasting: Method # 4. 

Deductive Method: 


Under this method future trends are based on observation and 
investigation. In addition to the critical analysis of the past events to draw 
future inferences, the subjective evaluation and conclusions for deducing 
discretion, experience and intuition of the forecaster. 


This method can be regarded as more dynamic in character as it takes into 
consideration not only the historical sequence of events but also the latest 
developments. However, the main drawback of this method is that it relies 
more on individual judgement and initiative appraisal than on actual 
record. 


Business Forecasting: Method # 5. 

Joint Opinion Method: 

As the name suggests, this method utilises the collective opinion, 
judgement and experience of various experts. A committee for business 
forecasting is formulated to take the joint view of various members. An 
attempt is made to evolve consensus for predicting future events on the 
basis of their views. 

The main advantages of this method include: 

(i) It encourages co-operation and co-ordination and also utilises the 
services of various experts; 

(ii) There is no need of detailed statistical analysis, and 

(iii) It is simple and easy to operate. 

However, the main disadvantage of this method is the joint responsibility 
which may ultimately result into no-body’s responsibility. The members of 
the committee may also not take active interest as they know that their 
judgement may not be finally accepted. This may degenerate the entire 
forecasting process into a mere guess work. 

Business Forecasting: Method # 6. 

Scientific Business Forecasting: 

Under this method, forecasting is done on scientific lines by making use of 
various statistical tools, such as, business index or barometer, extrapolation 
or mathematical projections, regression and econometric models. Past 
statistical data modified in the light of changed present conditions provides 
the basic raw material for drawing more accurate conclusions for the 
future. 

The following are some of the most important statistical tools used for 
business forecasting: 


(a) Business Index or Barometer. 
(b) Extrapolation or Mathematical Projection. 


(c) Regression 
(d) Econometric Model. 


(a) Business Index or Barometer: 
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The term ‘business index’ refers to a series relating to business conditions. 
It is also known as ‘barometer’, ‘indicator’ or ‘economic forecaster.’ Such 
a business index number may relate to general conditions of business or to 
a particular trade or industry or to an individual business. 


The index number may measure changes in business activity during the 
changes of cyclical variations, i.e. boom, decline, depression and recovery. 
It is called business barometer because it helps in making forecasts for 
future business conditions. 


The indices of production, wages, trade, finance, stocks and shares, etc. are 
plotted on a graph paper to obtain the curve showing trend of long-period 
and seasonal movements. The various index numbers relating to different 
activities of business may be combined into a general or composite index 
of business activity.’ 

This general index is an indicator of future conditions of trade and industry 
in general. However, the behaviour of individual trade or industry might 
show a different trend from that of general index, As such, the study of 
general index should be supplemented by separate studies of individual 
trade or industry. 


The following are some of the important series which are considered by 
businessmen for forecasting: 


(i) Index of Wholesale Prices 

(ii) Index of Consumer Prices 

(iii) Index of Industrial Product 

(iv) Gross National Product 

(v) Employment 

(vi) General Aggregate Consumption 

(vit) Volume of Agricultural Production 

(viii) Stock Exchange Index. 

The different figures may be converted into relatives on a certain base. The 
weighted average of these relatives may be computed to ascertain the 
business index called the barometer. 

These business barometers guide the businessmen in taking decisions on 
many problems like expansion of production activity, diversification, 


undertaking of a new project, exploring new markets, launching as sales 
campaign rising of funds through issue of shares or debentures etc. 
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The reports on general business and trade conditions are published by the 
Chamber of Commerce, industry and some trade associations. Important 
journals and newspapers also publish index numbers relating to various 
industries and trades. The Reserve Bank of India also publishes various 
index numbers and indicators of general economic conditions. 


The business barometers are very useful in business forecasting, but 
sometimes these barometers give misleading conclusions due to inaccurate 
construction of index numbers or changed conditions. As many factors 
may prevent history to repeat it, it is necessary to modify the trend revealed 
by business barometers in the light of specific conditions influencing the 
judgement. 


(b) Extrapolation or Mathematical Projection: 


Extrapolation is the process of estimating a value for some future period, 
based on some assumptions. 


The basic assumptions underlying this statistical tool of business 
forecasting include: 


(i) There should not be sudden jumps in figures from one period to another; 
and 


(ii) The conditions in the future will not change materially. 
(c) Regression: 


The regression equation, y=a+bx, can be used as an instrument to predict 
the value of y for a given value of x. The regression equation is highly used 
in physical sciences where the data are related functionally. However, in 
business forecasting it may be very difficult to establish functional 
relationships and hence the use of regression equation is also limited. 


(d) Econometric Models: 


Economic activities describe in terms of mathematical equations are 
referred to as econometric models. These models show the way of inter- 
relationships amongst the various aspects of the economy. The econometric 
models are not very popular because it is not possible for every business to 
develop his own model of the economy. 
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Structure 


12.1 Introduction 

12.2 Regression analysis 

12.3 Exponential Smoothing Method 

12.4 Theories of Business Forecasting: 
12.5 Theory of Economic Rhythm 

12.6 Action and Reaction Approach 

12.7 Sequence Method or Time Lag Method 
12.8 Specific Historical Analogy 

12.9 Cross-Cut Analysis 

12.10 Model Building Approach 

12.11 Utility of Business Forecasting 
12.12 Limitations of Business Forecasting 
12.13 Business Forecasting: Advantage 


12.1 Introduction 


A series of observations, on a variable, recorded after successive intervals 
of time is called a time series. The successive intervals are usually equal 
time intervals, e.g., it can be 10 years, a year, a quarter, a month, a week, a 
day, and an hour, etc. The data on the population of India is a time series 
data where time interval between two successive figures is 10 years. 
Similarly figures of national income, agricultural and industrial production, 
etc., are available on yearly basis. 


12.2 Regression analysis 


The main objective of regression analysis is to know the nature of 
relationship between two variables and to use it for predicting the most 
likely value of the dependent variable corresponding to a given, known 
value of the independent variable. This can be done by substituting in 
Eq.(5.1a) any known value of X corresponding to which the most likely 
estimate of Y is to be found. 


For example, the estimate of Y (i.e. Yc), corresponding to X = 15 is 


Y. = 8.61 + 0.71(15) 


8.61 + 10.65 

19.26 
It may be appreciated that an estimate of Y derived from a regression 
equation will not be exactly the same as the Y value which may actually be 
observed. The difference between estimated Ye values and the 
corresponding observed Y values will depend on the extent of scatter of 
various points around the line of best fit. 
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Time Series Analysis loser the various paired sample points (Y, X) clustered around the line of 

NOTES best fit, the smaller the difference between the estimated Y, and observed Y 
values, and vice-versa. On the whole, the lesser the scatter of the various 
points around, and the lesser the vertical distance by which these deviate 
from the line of best fit, the more likely it is that an estimated Yọ value is 
close to the corresponding observed Y value. 


12.3 Exponential Smoothing Method 
The exponential smoothing method also facilities continuous 
updating of the estimate of MAD. The current MAD; is given by 
MAD; = a Actual values- Forecasted values + (1-a) MAD; 


Higher values of smoothing constant a make the current MAD more 
responsive to current forecast errors. 


Example 7.7: A firm uses simple exponential smoothing with a =0.1 to 
forecast demand. The forecast for the week of February 1 was 500 units 
whereas actual demand turned out to be 450 units. 


Forecast the demand for the week of February 8. 

Assume the actual demand during the week of February 8 turned out 
to be 505 units. Forecast the demand for the week of February 15. 
Continue forecasting through March 15, assuming that subsequent 
demands were actually 516, 488, 467, 554 and 510 units. Solution: 
Given F;-1 = 500, D 7-1 = 450, and a = 0.1 

F; = F ¢-1 — a(Dz-1 - Fy-1) = 500 + 0.1(450-500) = 495 units 


Forecast of demand for the week of February 15 is shown in Table 7.5 
Table 7.5: Forecast of Demand 


New 
Forecast 
Week Demand Old Forecast Correction (Fp 
F:-ı +a(D- 
Di Forecast Error a(D;-1 -Fy-1) 1-F,-1) 
Fy (D1 —F:-1) 
Feb. 1 450 500 -50 -5 495 
Feb. 8 505 495 10 1 496 
Feb. 15 516 496 20 2 498 
Feb. 22 488 498 -10 -1 497 
Mar. 1 467 497 -30 -3 494 
Self-Instructional Material 

Mar. 8 554 494 60 6 500 
Mar. 15 510 500 10 1 501 


If no previous forecast value is known, the old forecast starting point 
may be estimated or taken to be an average of some preceding 
periods. 


The estimated Yo values will coincide the observed Y values only when all 
the points on the scatter diagram fall in a straight line. If this were to be so, 
the sales for a given marketing expenditure could have been estimated with 
100 percent accuracy. But such a situation is too rare to obtain. Since some 
of the points must lie above and some below the straight line, perfect 
prediction is practically non-existent in the case of most business and 
economic situations. 


This means that the estimated values of one variable based on the known 
values of the other variable are always bound to differ. The smaller the 
difference, the greater the precision of the estimate, and vice-versa. 
Accordingly, the preciseness of an estimate can be obtained only through a 
measure of the magnitude of error in the estimates, called the error of 
estimate. 


12.4 Theories of Business Forecasting: 


Theory of Economic Rhythm 

Action and Reaction Approach 
Sequence Method or Time Lag Method 
Specific Historical Analogy 

Cross-Cut Analysis 

Model Building Approach 

Business Forecasting: Theory # 1. 


SON ORES 


12.5 Theory of Economic Rhythm: 


This theory propounds that the economic phenomena behave in a rhythmic 
manner and cycles of nearly the same intensity and duration tend to recur. 
According to this theory, the available historical data have to be analysed 
into their components, i.e. trend, seasonal, cyclical and irregular variations. 

The secular trend obtained from the historical data is projected a number of 
years into the future on a graph or with the help of mathematical trend 
equations. If the phenomena is cyclical in behaviour, the trend should be 
adjusted for cyclical movements. 

When the forecast for a year is to be split into months or quarters then the 
forecaster should adjust the projected figures for seasonal variations also 
with the help of seasonal indices. 


It becomes difficult to predict irregular variations and hence, rhythm 
method should be used along with other methods to avoid inaccuracy in 
forecasts. However, it must be remembered that business cycles may not be 
strictly periodic and the very assumptions of this theory may not be true as 
history may not repeat. 


Business Forecasting: Theory # 2. 


12.6 Action and Reaction Approach: 
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This theory is based on the Newton’s ‘Third Law of Motion’, i.e., for every 
action there is an equal and opposite reaction. When we apply this law to 
business, it implies that it there if depression in a particular field of 
business, there is bound to be boom in it sooner or later. It reminds us of 
the business, cycle which has four phases, i.e., prosperity, decline, 
depression and prosperity. 

This theory regards a certain level of business activity as normal and the 
forecaster has to estimate the normal level carefully. According to this 
theory, if the price of commodity goes beyond the normal level, it must 
come down also below the normal level because of the increased 
production and supply of that commodity. 


Business Forecasting: Theory # 3. 


12.7 Sequence Method or Time Lag Method: 


This theory is based on the behaviour of different businesses which show 
similar movements occurring successively but not simultaneously. As such, 
this method takes into account time lag based on the theory of lead-lag 
relationship which holds good in most cases. 

The series that usually change earlier serve as forecast for other related 
series. However, the accuracy of forecasts under this method depends upon 
the accuracy with which time lag is estimated. 


Business Forecasting: Theory # 4. 


12.8 Specific Historical Analogy: 


This theory is based on the assumption that history repeats itself. It simply 
implies that whatever happened in the past under a set of circumstances is 
likely to happen in future under the same set of conditions. 

Thus, a forecaster has to analyse the past data to select such period whose 
conditions are similar to the period of forecasting. Further, while predicting 
for the future, some adjustments may be made for the special 
circumstances which prevail at the time of making the forecasts. 


Business Forecasting: Theory # 5. 


12.9 Cross-Cut Analysis: 


In this method of business forecasting, the combined effect of various 
factors is not studied, but the effect of each factor, that has a bearing on the 
forecast, is studied independently. This theory is similar to the Analysis of 
Time Series under the statistical methods. 

Business Forecasting: Theory # 6. 


12.10 Model Building Approach: 


This approach makes use of mathematical equations for drawing economic 
models. These models depict the inter-relationships amongst the various 
factors affecting the economy or business. The expected values for 
dependent variables are then ascertained by putting the values of known 
variables in the model. This approach is highly mechanical and this can be 
rarely employed in business conditions. 


12.11 Utility of Business Forecasting 


Meaning and Definition:Business forecasting is an act of predicting the 
future economic conditions on the basis of past and present information. It 
refers to the technique of taking a prospective view of things likely to 
shape the turn of things in foreseeable future. As future is always uncertain, 


there is a need of organised system of forecasting in a business.Thus, 
scientific business forecasting involves: 
(i) Analysis of the past economic conditions and 


(ii) Analysis of the present economic conditions; so as to predict the future 
course of events accurately. 


In this regard, business forecasting refers to the analysis of the past and 
present economic conditions with the object of drawing inferences about 
the future business conditions. In the words of Allen, “Forecasting is a 
systematic attempt to probe the future by inference from known facts. The 
purpose is to provide management with information on which it can base 
planning decisions. 


Leo Barnes observes, “Business Forecasting is the calculation of 
reasonable probabilities about the future, based on the analysis of all the 
latest relevant information by tested and logically sound statistical 
econometric techniques, as interpreted, modified and applied in terms of an 
executive’s personal judgment and social knowledge of his own business 
and his own industry or trade”. 

In the words of C.E. Sulton, “Business Forecasting is the calculation of 
probable events, to provide against the future. It therefore, involves a ‘look 
ahead’ in business and an idea of predetermination of events and their 
financial implications as in the case of budgeting.” 


According to John G. Glover, “Business Forecasting is the research 
procedure to discover those economic, social and financial influences 
governing business activity, so as to predict or estimate current and future 
trends or forces which may have a bearing on company policies or future 
financial, production and marketing operations.” 


The essence of all the above definitions is that business forecasting is a 
technique to analyse the economic, social and financial forces affecting the 
business with an object of predicting future events on the basis of past and 
present information. 

Steps of Forecasting: 

The process of forecasting consists of the following steps, also described as 
elements of forecasting: 

1. Developing the Basis: 

The first step involved in forecasting is developing the basis of systematic 
investigation of economic situation, position of industry and products. The 
future estimates of sales and general business operations have to be based 
on the results of such investigation. The general economic forecast marks 
as the primary step in the forecasting process. 

2. Estimating Future Business Operations: 

The second step involves the estimation of conditions and course of future 
events within the industry. On the basis of information/data collected 
through investigation, future business operations are estimated. The 
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quantitative estimates for future scale of operations are made on the basis 
of certain assumptions. 

3. Regulating Forecasts: 

The forecasts are compared with actual results so as to determine any 
deviations. The reasons for his variations are ascertained so that corrective 
action is taken in future. 

4. Reviewing the Forecasting Process: 

Once the deviations in forecasts and actual performance are found then 
improvements can be made in the process of forecasting. The refining of 
forecasting process will improve forecasts in future. 

Sources of Data Used In Business Forecasting: 

Collection of data is a first step in any statistical investigation. It is the 
basis for any analysis and interpretations. Before collection of data, many 
questions shall occupy the mind of the manager. The manager must be able 
to answer these questions before task of collection is started. 


12.12 Limitations of Business Forecasting: 


Inspite of many advantages, some people regard business forecasting “as 
an unnecessary mental gymnastics and reject it as a sheer waste of time, 
money and energy.” 


The reason for the same lies in the fact that despite all precautions, an 
element of error is bound to creep in the forecasts and we cannot eliminate 
guesswork in forecasts. It is also felt that forecasting is influenced by the 
pessimistic or optimistic attitude of the forecaster. 


It may not be possible to make forecasts with a pin-point accuracy. But, it 
still cannot undermine the importance of business forecasting. The 
management should first make use of statistical and econometric models in 
making forecasts and then apply collective experience, skill and objective 
judgement in evaluating the forecasts. 


12.13 Business Forecasting: Advantage 


Business Forecasting: Advantage # 1. Establishing a New Business: 

While setting up a new business, a number of business forecasts are 
required. One has to forecast the demand for the product, capacity of 
competitors, expected share in the market, the amount and sources of 
raising finances, etc. 


The success of a new business will depend upon the accuracy of such 
forecasts. If the forecasts are made systematically, then the operations of 
the business will go smoothly and the chances of failure will be minimised. 


Business Forecasting: Advantage # 2. Formulating Plans: 

Forecasting provides a logical basis for preparing plans. It plays a major 
role in managerial planning and supplies the necessary information. The 
future assessment of various factors is essential for preparing plans. In fact, 
planning without forecasting is an impossibility. Henry Fayol has rightly 
observed that the entire plan of an enterprise is made up of a series of plans 
called forecasts. 


Business Forecasting: Advantage # 3. Estimating Financial Needs: 


Every business needs adequate capital. In the absence of correct estimates 
of financial requirements, the business may suffer either from inadequate 
or from excess capital. Forecasting of sales and expenses helps in 
estimating future financial needs. 


The plans for expansion, diversification or improvement also necessitate 
the forecasting of requirements of funds. A proper financial planning 
depends upon systematic forecasting. 


Business Forecasting: Advantage # 4. Facilitating Managerial Decisions: 
Forecasting helps management to take correct decisions. By providing a 
logical basis for planning and determining in advance the nature of future 
business operations, it facilitates correct managerial decisions about 
material, personnel, sales and other requirements. 


Business Forecasting: Advantage # 5. Quality of Management: 

It improves the quality of managerial personnel by compelling them to 
look into the future and make provision for the same. By focussing 
attention on future, forecasting helps the management in adopting a 
definite course of action and a set purpose. 


Business Forecasting: Advantage # 6. Encourages Co-operation and co- 
ordination: 

Forecasting calls for some minimum effort on the part of all and. thus, 
creates a sense of participation. It is not a one man’s or one department’s 
job. No department or person can make its forecasts in isolation. There 
should be a proper co-operation and co-ordination among different 
departments for setting proper forecasts for the business as a whole. 


So, forecasting process leads to better co-operation and co-ordination 
among people of various departments of the organisation. 


Business Forecasting: Advantage # 7. Better Utilisation of Resources: 
Forecasting ensures better utilisation of resources by revealing the areas of 
weaknesses and providing necessary information about the future. 
Management can concentrate on critical areas and control more effectively. 


Business Forecasting: Advantage # 8. Success in Business: 

Success in business, to a great extent, depends upon correct predictions 
about the future. Systematic forecasting ensures smooth and continuous 
working of the business. By knowing the future course of events in 
advance, one could always face the difficulties in a planned manner. 
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Analysis of Time Series UNIT XIII - ANALYSIS OF TIME 
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13.0 INTRODUCTION 


When quantitative data are arranged in the order of their 
occurrence, the resulting statistical series is called a time series. The 
quantitative values are usually recorded over equal time interval daily, 
weekly, monthly, quarterly, half yearly, yearly, or any other time 
measure. Monthly statistics of Industrial Production in India, Annual 
birth-rate figures for the entire world, yield on ordinary shares, weekly 
wholesale price of rice, and daily records of tea sales or census data are 
some of the examples of time series. Each has a common characteristic of 
recording magnitudes that vary with passage of time. In this unit we will 
see about time series analysis. 


13.1 OBJECTIVES 


After going through this unit, you will 


e Learn about time series analysis 
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f e Know about the measurement of trends 
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e Understand forecasting and Deseasonalisation 


13.2 TIME SERIES ANALYSIS 


Time series are influenced by a variety of forces. Some are continuously 
effective other make themselves felt at recurring time intervals, and still 
others are non-recurring or random in nature. Therefore, the first task is 
to break down the data and study each of these influences in isolation. 
This is known as decomposition of the time series. It enables us to 
understand fully the nature of the forces at work. We can then analysis 
their combined interactions. Such a study is known as time-series 
analysis. 


13.2.1 COMPONENTS OF TIME SERIES: 


The factors that are responsible for bringing about changes in a 
time series, also called the components of time series, are as follows: 


e Secular Trends (or General Trends) 
e Seasonal Movements 

e Cyclical Movements 

e Irregular Fluctuations 


Secular Trends: 


Secular trend is the main component of a time series which results 
from long term effects of socio-economic and political factors. It shows 
the growth or decline in a time series over a long period. It is the type of 
tendency which continues to persist for a very long period. Prices and 
export and import data, for example, reflect obviously increasing 
tendencies over time. 


Seasonal Trends: 


Seasonal trends are short term movements occurring in data due 
to seasonal factors. The short term is generally considered as a period in 
which changes occur in a time series with variations in weather or 
festivities. For example, it is commonly observed that the consumption of 
ice-cream during summer is generally high and hence an ice-cream 
dealer's sales would be higher in some months of the year while 
relatively lower during winter months. Employment, output, exports, etc., 
are subject to change due to variations in weather. Similarly, the sale of 
garments, umbrellas, greeting cards and fire-works are subject to large 
variations during festivals like Valentine’s Day, Eid, Christmas, New 
Year's, etc. These types of variations in a time series are isolated only 
when the series is provided biannually, quarterly or monthly. 


Cyclic Movements 


It is a long term oscillations occurring in a time series. These 
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oscillations are mostly observed in economics data and the periods of 
such oscillations are generally extended from five to twelve years or 
more. These oscillations are associated with the well known business 
cycles. These cyclic movements can be studied provided a long series of 
measurements, free from irregular fluctuations, is available. 


Irregular Fluctuations 


It happens when a sudden changes occurring in a time series 
which are unlikely to be repeated. They are components of a time series 
which cannot be explained by trends, seasonal or cyclic movements. 
These variations are sometimes called residual or random components. 
These variations, though accidental in nature, can cause a continual 
change in the trends, seasonal and cyclical oscillations during the 
forthcoming period. Floods, fires, earthquakes, revolutions, epidemic, 
strikes etc., are the root causes of such irregularities. 


13.2.2 ANALYSIS OF TIME SERIES 


The objective of the time series analysis is to identify the 
magnitude and direction of trends, to estimate the effect of seasonal and 
cyclical variations and to estimate the size of the residual component. 
This implies the decomposition of a time series into its several 
components. Two lines of approach are usually adopted in analyzing a 
given time series: 


e The additive model 
e The multiplicative model 


The additive model: 


It is used when the four components of a time series are 
independent of one another. Independent means the magnitude and 
patterns of movement of the components do not affect each other. Using 
this assumption the magnitudes of the time series are regarded as the sum 
of separate influences of its four components. In additive approach, the 
unit of measurements remains the same for all the four components. The 
additive model can be written as 


Y=T+S+C+R 
Where Y = magnitude of a time series 
T = Trend, 
C =Cyclical component, 
S =Seasonal component, 
R = Random component. 
The multiplicative model: 
It is used where the forces giving rise to the four types of 
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variations are interdependent. The magnitude of time series is the product 
of four components. Then the multiplicative model can be written as 


Y=TxSxCxR 


The additive model is usually used when the time series is spread 
over a short time span or where the rate of growth or decline in the trend 
is small. The multiplicative model, which is used more often than the 
additive model, is generally used whenever the time span of the series is 
large or the rate of growth or decline is 


Y-T=S+C+R or “= SxCxR 
Similarly, a de-trended, de-seasonalized series may be obtained as 
Y-T-S=C+R or —= CxR 


It is not always necessary for the time series to include all four 
types of variations; rather, one or more of these components might be 
missing altogether. For example, when using annual data the seasonal 
component may be ignored, while in a time series of a short span having 
monthly or quarterly observations, the cyclical component may be 
ignored 


13.3 MEASUREMENT OF TRENDS 


e Moving average method 
e Least square method 
13.3.1 MOVING AVERAGE METHOD 

Moving average method is a simple device of reducing 
fluctuations and obtaining rend values with a fair degree of accuracy. In 
this method the average value of a number of years (months, weeks, or 
days) is taken as the trend value for the middle point of the period of 
moving average. The process of averaging smoothes the curve and 
reduces the fluctuations. 

The first thing to be decided in this method is the period of the 
moving average. What it means is to take a decision about the number of 
consecutive items whose average would be calculated each time. 
Suppose it has been decided that the period of the moving average would 
be 5 years (months, weeks, or days) then the arithmetic average of the 
first 2 items (number 1,2,34 and 5) would be placed against item no:3 
and then the arithmetic average of item Nos:2,3,4,5 and 6would be 
placed against item No: 4. This process would be repeated till the 
arithmetic average of the last five items has been calculated. 

Odd Period of Moving Average 


Calculation of three yearly moving averages includes the following steps 


1. Add up the values of the first 3 years and place the yearly sum 
against the median year. (This sum is called moving total) 
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Analysis of Time Series 2. Leave the first year value, add up the values of the next three 
years and place it against its median year. 


NOTES 7 


. This process must be continued till all the values of the data are 
taken for calculation. 


4. Each 3-yearly moving total must be divided by 3 to get the 3-year 
moving averages, which is our required trend value. 


The formula calculating 3 yearly moving averages is as follows 


a+b+c btct+td ctdte 
a on 
The formula calculating 5 yearly moving averages is as follows 
a+b+c+d+e btctdtetf ctdtetftg 
5 i 5 A 5 


Example: 


Calculate the 3 yearly and 5 yearly moving averages of the data 


Years| 1 |2 13 |4 15 |6 |7 |8 |9 |10 | 11 |12 


sales | 5.2/4. 15. 14 15. |5. 15. 5. 15. J60 |5. 14. 
9 |5 |9 2 |7 14 I8 J9 JO 2 |8 


Solution: 

Year | Sales 3 Year 3 Year 5 Year 5 Year 

Moving Moving Moving Moving 
Total Average Total Average 
(3)/3 (4)/5 

1 5.2 | --- -- -- 

2 4.9 15.6 5.2 -- -- 

3 5.5 15.3 5.1 25.7 5.14 

4 4.9 15.6 5:2 26.2 5.24 

5 3.2 15.8 5.27 26.7 5.34 

6 5.7 16.3 5.41 27.0 5.4 

7 5.4 16.9 5.63 28.0 5.6 

8 5.8 17.1 5.7 28.8 5.76 

9 5.9 17.7 5.23 28.3 5.66 

10 6.0 17.1 5.7 Ded 5.54 

11 5:2 16.0 5.33 --- --- 

12 4.8 | --- --- --- --- 


Even Period of Moving Average: 


The period of moving average is 4,6, or 8, it is even number. The 
Self-Instructional Material four yearly total cannot be placed against any year as median 2.5 is 
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between the second and the third year. So the total should be placed in 
between the 2" and 3™ years. We must centre the moving average in 
order to place the moving average against the year 


Steps to find even period of moving average: 


1. 


Add up the values of the first 4 years and place the sum against 
the middle of 2na and 3ra year. (This sum is called 4 year moving 
total) 


2. Leave the first year value and add next 4 values from the 2nd year 
onward and write the sum against its middle position. 

3. This process must be continued till the value of the last item is 
taken into account. 

4. Add the first two 4-years moving total and write the sum against 
3rd year. 

5. Leave the first 4-year moving total and add the next two 4-year 
moving total and place it against 4th year. 

6. This process must be continued till all the 4-yearly moving totals 
are summed up and centered. 

7. Divide the 4-years moving total by 8 to get the moving averages 
which are our required trend values 

Example: 


Find the 4 yearly moving average foe determining trend values in the 
following time series data 


Year 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 


Profit in(000) 12 14 16 15 13 14 18 


Solution: 
Years | Profit | Sum of|4 years Moving |4 yearly Moving 
Fours Average Average Centered 
2005 |12 
2006 | 14 
57 14.25 (14.25 + 14.50) 2 = 
14.38 
2007 |16 
58 14.50 (14.50 + 14.50) 2 = 
14.50 
2008 |15 
58 14.50 (14.50 + 15.00/ 2 = 
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14.75 
2009 | 13 
60 15.00 
2010 | 14 
2011 | 18 
Advantages 


Moving averages can be used for measuring the trend of any 
series. This method is applicable to linear as well as non-linear trends. 


Disadvantages 


The trend obtained by moving averages generally is neither a 
straight line nor a standard curve. For this reason the trend cannot be 
extended for forecasting future values. Trend values are not available for 
some periods at the start and some values at the end of the time series. 
This method is not applicable to short time series 


13.3.2 LEAST SQUARES METHOD 


When the trend is linear the trend equation may be represented by 
y = a + bt and the values of a and b for the line y = a + bt which 
minimizes the sum of squares of the vertical deviations of the actual 
(observed) values from the straight line, are the solutions to the so called 
normal equations: 


Ly = Nat DX .........cecoeees (1) 
Eyt =aLt+bUe’ ..........0 (2) 
Where n is the number of paired observations 


The normal equation are obtained by multiplying y = a + bt, by 
the coefficient of a and b, i.e., by 1 and t throughout and summing up. 


When the Number of Years is Odd 


We can use this method when we are given odd number of years. It is 
easy and is widely used in practice. If the number of items is odd, we 
can follow the following steps: 


1. Denote time as the t variable and values as y 
2. Middle year is assumed as the period of origin and find out 
deviations 
3. Square the time deviations and find t. 
4. Multiply the given value of y by the respective deviation of t and 
find the total Xty. 
Find out the values of y; get Xy 
6. The value so obtained are placed in the two quations 
i, Ly=nat+bxt 
i. = =Lyt=aLt+ bxt’; find out the value of a and b 
7. The calculated values of a and b are substituted and the trend 
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value of y are found for various values of t. Analysis of Time Series 


When the number of years is odd the calculation will be simplified by NOTES 
taking the mid year as origin and one year as unit and in that case 


Xt = 0 and the two normal equations take the form 

Ly = na ; Lyt=bxre’ 

Hence a =~ ,b=— 
Example : 


Calculate trend values by the method of least square from data given 
below and estimate the sales for 2003 


Years: 1996 | 1997 | 1998 | 1999 | 2000 


Sales of Co.A, (O Lakhs)| 70 74 80 86 90 


Solution: 
Year Sales Deviation from 1998 
y t ty t 
1996 70 -2 -140 4 
1997 74 si -74 1 
1998 80 0 0 0 
1999 86 1 86 1 
2000 90 2 180 4 
n=5 Ly = 400 xt =0 Xty = 52 xt? = 10 
Since Èt = 0 
=¥ = g80 pat =~ =52 
n 5 re ~ 10 


Hence, y = 80 + 5.2 xt 
Therefore Y19965 = 80 + 5.2 ( - 2) = 80 — 10.4 = 69.6 


Self-Instructional Material 
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Yi907 = 80 + 5.2 (-1) = 80 — 5.2 = 74.8 
Yioog = 80 + 5.2 ( 0 ) = 80 + 0 = 80 
Y19965 = 80 + 5.2 ( 1) = 80 + 5.2 = 85.2 
Y19956 = 80 + 5.2 ( 2) = 80 + 10.4 = 90.4 

For 2003, t will be 5. Putting t = 5 in the equation 

Y2013 = 80 + 5.2 (5\0 = 80 + 26 = 106 

Thus the estimated sales for the year 2003 is O 106 lakhs 


When the Number of Years is Even 


When the number of years is even the origin is placed in the midway 
between the two middle years and the unit is taken to be half year instead 


of one year. With this change of origin and scale we have 
xt=0 

Hence a = Z beet 

n xt? 


Example: 


Production of a company for 6 consecutive years is given in the 
following table. Calculate the trend value by using the method of least 
square 


Year 2000 2001 2002 2003 2004 2005 
Production | 12 13 18 20 24 28 
Solution: 
Year | Sales Deviation from 2002.5 Trend values 
y t ty t 
2000 12 -2.5 -30 6.25 11.5 
2001 13 -1.5 -19.5 225 14.5 
2002 18 -0.5 -9 0.25 17.53 
2003 20 0.5 10 0.25 20.81 
2004 24 1.5 36 2.25 24.09 
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2005 28 2.5 70 6.25 27.37 


n=6 | Ly = 115 | £t =0 | £ty = 57.5 | Ut? = 17.5 


Sincet=0 


ee olai SA LoL 
eee = 19.17 b= ee =z; 7 3.28 


Hence, y = 19.17 + 3.28 xt 
Therefore y2000 = 19.17 + 3.28 (- 2.5) = 19.17 — 8.2 = 11.5 

y2001 = 19.17 + 3.28 (- 1.5) = 19.17 — 4.92 = 14.5 
y2002 = 19.17 + 3.28 (- 0.5) = 19.17 — 1.64 = 17.53 
y2003 = 19.17 + 3.28 (0.5) = 19.17 + 1.64 = 20.81 
y2004 = 19.17 + 3.28 (1.5) = 19.17 + 4.92 = 24.09 
Y2005 = 19.17 + 3.28 (2.5) = 19.17 + 8.2 = 27.37 

Merits 


1. The method is mathematically sound. 

2. The estimates a and b are unbiased. 

3. The least square method gives trend values for all the years and 
the method is devoid of all kinds of subjectivity. 

4. The algebraic sum of deviations of actual values from trend 
values is zero and the sum of the deviations is minimum. 


Demerits 


1. The least square method is highly mathematical; therefore, it is 
difficult for a layman to understand it. 

2. The method is not flexible. 

3. It has been assumed that y is only a linear function of time period 
n. This may not be true in any situations. 


13.4 MEASUREMENT OF SEASONAL VARIATION 


Seasonal variations are that rhythmic changes in the time series 
data that is regular and periodic variations having a period of one year 
duration. Some of the examples which show seasonal variations are 
production of cold drinks, which are high during summer months and 
low during winter season. Sales of sarees in a cloth store which are high 
during festival season and low during other periods. They have their 
origin in climatic or institutional factors that affect either supply or 
demand or both. It is important that these variations should be measured 
accurately. The reason for determining seasonal variations in a time 
series is to isolate it and to study its effect on the size of the variable in 
the index form which is usually referred as seasonal index. 
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13.4.1 METHODS OF CONSTRUCTING SEASONAL INDICES 
There are four methods of constructing seasonal indices. 
1. Simple averages method 
2. Ratio to trend method 
3. Percentage moving average method 
4. Link relatives method 
Simple Average Method : 


The time series data for each of the 4 seasons (for quarterly data) 
of a particular year are expressed as percentages to the seasonal average 
for that year. The percentages for different seasons are averaged over the 
years by using simple average. The resulting percentages for each of the 
4 seasons then constitute the required seasonal indices. 


Steps to calculate Simple Average Method: 


(i) Arrange the data by months, quarters or years according to the data 
given. 


(11) Find the sum of the each months, quarters or year. 
(iii) Find the average of each months, quarters or year. 
(iv) Find the average of averages, and it is called Grand Average (G) 


(v) Compute Seasonal Index for every season (i.e) months, quarters or 
year is given by 


SeasonalAverage 


Seasonal Index (S.I) = x 100 


Grandaverage 


If the data is given in months 


monthly Average (forjan ) 


Seasonal Index for Jan (S.I) = x 100 


Grandaverage 


monthly Average (forfeb ) 


Seasonal Index for Feb (S.I) = x 100 


Grandaverage 
Similarly we can calculate SI for all other months 
Example: 


Calculate the seasonal index for the quarterly production of a computer 
using method of simple average 


Year | I Quarter | H Quarter | III Quarter | IV Quarter 


2011 | 355 451 525 500 
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2012 | 369 410 496 510 

2013 | 391 432 458 495 

2014 | 298 389 410 457 

2015 | 300 390 431 459 

2016 | 350 400 450 500 
Solution: 

Year I Quarter | II Quarter | HI Quarter | IV Quarter 
2011 355 451 525 500 
2012 369 410 496 510 
2013 391 432 458 495 
2014 298 389 410 457 
2015 300 390 431 459 
2016 350 400 450 500 
Quarterly Total 2063 2472 2770 2921 
Quarterly Averages | 343.83 | 412 461.67 486.83 


SeasonalAverage 


Seasonal Index (S.I) = x 100 


GrandAverage 


343.83 + 412+ 461.67 + 486.83 _ 1704.33 
7 = 


Grand average = = 426.0825 


343.83 


S.I forI Q = OE 002E 


x 100 = 80.69 


SI for Il Q = = x 100 = 96.69 
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SI for Q = ax 100 = 108.35 
S.I for IV Q = E 100 = 114.26 


Advantage and Disadvantage: 
e Method of simple average is easy and simple to execute. 


e This method is based on the basic assumption that the data do not 
contain any trend and cyclic components. Since most of the 
economic and business time series have trends and as such this 
method though simple is not of much practical utility. 


CHECK YOUR PROGRESS - 1 
1. A time series is a set of data recorded 


2. The terms prosperity, recession, depression and recovery are in 
particular attached to 


3. What is time series? 


13.5 FORECASTING 


Time series forecasting methods produce forecasts based solely on 
historical values and they are widely used in business situations where 
forecasts of a year or less are required. These methods used are 
particularly suited to Sales, Marketing, Finance, Production planning etc. 
and they have the advantage of relative simplicity. Time series 
forecasting is a technique for the prediction of events through a sequence 
of time. 


The technique is used across many fields of study, from geology to 
economics. The techniques predict future events by analyzing the trends 
of the past on the assumption that future trends will hold similar to 
historical trends. Data is organized around relatively deterministic 
timestamps, and therefore, compared to random samples, may contain 
additional information that is tried to extract. 


e Time series methods are better suited for short-term forecasts 
(i.e., less than a year). 


e Time series forecasting relies on sufficient past data being 
available and that the data is of a high quality and truly 
representative. 


e Time series methods are best suited to relatively stable situations. 
Where substantial fluctuations are common and underlying 
conditions are subject to extreme change, then time series 
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methods may give relatively poor results. 

Advantages of forecasting: 

1. Helps to predict the future: 

2. Learns from the past 

3. Remain competitive 

4. Prepare for new business 
Disadvantages of forecasting: 

1. Basis of forecasting 

2. Reliability of past data 


3. Time and cost factor 


13.6 DESEASONALISA TION 


When the seasonal component is removed from the original data, 
the reduced data are free from seasonal variations and is called 
deseasonalised data. That is, under a multiplicative model 

TxSxCxI 


; =TxCxI 


Deseasonalised data being free from the seasonal impact manifest only 
average valueof data. 


Seasonal adjustment can be made by dividing the original data by the 
seasonal index. 


i ORIGINALDATA 
Deseasonalised data = ———————_ X 100 
SEASONALINDEX 


where an adjustment-multiplier 100 is necessary because the seasonal 
indices are usually given in percentages. 


In case of additive model 


Y,=T+S+C+I 


è Doi seasonalindex 
Deseasonalised data = originaldata — oo 
seasonalindex 
= Yt — = 
100 


CHECK YOUR PROGRESS -2 


4. Define forecasting? 


5. What are the method used for finding seasonal indices 


13.7 SUMMARY 


e Time series are influenced by a variety of forces. Some are 
continuously effective other make themselves felt at recurring 
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time intervals, and still others are non-recurring or random in 
nature. Therefore, the first task is to break down the data and 
study each of these influences in isolation. This is known as 
decomposition of the time series. 


The objective of the time series analysis is to identify the 
magnitude and direction of trends, to estimate the effect of 
seasonal and cyclical variations and to estimate the size of the 
residual component. This implies the decomposition of a time 
series into its several components. Two lines of approach are 
usually adopted in analyzing a given time series: 


o The additive model, the multiplicative model 


Moving average method is a simple device of reducing 
fluctuations and obtaining rend values with a fair degree of 
accuracy. In this method the average value of a number of years 
(months, weeks, or days) is taken as the trend value for the 
middle point of the period of moving average. The process of 
averaging smoothes the curve and reduces the fluctuations. 

When the trend is linear the trend equation may be represented by 
y = a + bt and the values of a and b for the line y = a + bt which 
minimizes the sum of squares of the vertical deviations of the 
actual (observed) values from the straight line, are the solutions to 
the so called normal equations: 

Seasonal variations are that rhythmic changes in the time series 
data that is regular and periodic variations having a period of one 
year duration. 


There are four methods of constructing seasonal indices. They are 
Simple averages method, Ratio to trend method, Percentage 
moving average method, Link relatives method. 


Time series forecasting methods produce forecasts based solely 
on historical values and they are widely used in business 
situations where forecasts of a year or less are required. 


When the seasonal component is removed from the original data, 
the reduced data are free from seasonal variations and is called 
deseasonalised data. 


13.8 KEY WORDS 


Time series,decomposition of the time series, additive model, the 
multiplicative model, Moving average method , least square method , Seasonal 
variations , Simple averages method, Ratio to trend method, Percentage moving 
average method, Link relatives method, forecasting , deseasonalised. 
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13.9 ANSWERS TO CHECK YOUR PROGRESS 


1. Periodically, at equal time intervals, at successive points of 
time 
2.Cyclical movements 


3. Time series are influenced by a variety of forces. Some are 
continuously effective other make themselves felt at recurring 
time intervals, and still others are non-recurring or random in 
nature. 


4. Time series forecasting methods produce forecasts based solely 
on historical values and they are widely used in business 
situations where forecasts of a year or less are required 


5.There are four methods of constructing seasonal indices. 
1. Simple averages method 
2. Ratio to trend method 
3. Percentage moving average method 
4 


Link relatives method 


13.10 QUESTION AND EXERCISE 
SHORT ANSWER QUESTION 
1. What is time series? 
2. What are the uses of time series 
3. What are basic types of variations 
LONG ANSWER QUESTION 
1. Explain the components of time series 
2. What are the various methods of estimating the trend components 
3. Explain the moving average method? How is it calculated? 
4. Describe the method of finding seasonal indices 
5. Calculate the trend value by using three yearly moving average of 


the following data 


Year 


1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 


Production | 21 | 22 | 23 | 25 | 24 | 22 | 25 | 26 | 27 | 26 


13.11 FURTHER READINGS 
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Spiegel, Murray R.: Theory and Practical of Statistics., London 
McGraw Hill Book Company. 
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UNIT XIV =- INDEX NUMBER Index Numbers 


Structure NOTES 

14.0 Introduction 

14.1 Objectives 

14.2 Index Numbers 
14.2.1 Types of Index Numbers 
14.2.2 Problems in construction of Index Numbers 
14.2.3 Methods of Constructing Index Numbers 
14.2.4 Quantity or Volume Index Numbers 
14.2.5 Test for Index Numbers 
14.2.6 Chain Base Index Numbers 


14.3 Cost of living Index Numbers 
14.3.1 Construction of cost of living Index Numbers 
14.3.2 Methods to construct cost of living Index Numbers 
14.3.3 Uses of cost of living Index Numbers 

14.4 Uses of Index Numbers 

14.5 Limitations of Index Numbers 

14.6 Summary 

14.7 Key Words 

14.8 Answers to Check Your Progress 

14.9 Questions and Exercise 

14.10 Further Readings 


14.0 INTRODUCTION 


Index numbers are a commonly used statistical device for 
measuring the combined fluctuations in group-related variables. If we 
wish to compare the prices of consumer items today with their prices ten 
years ago, we are not interested in comparing the prices of only one item, 
but in comparing average price levels. We may wish to compare the 
present agricultural production or industrial production with that at the 
time of independence. Here again, we have to consider all items of 
production and each item may have undergone a different fractional 
increase (or even a decrease). How do we obtain a composite measure? 
This composite measure is provided by index numbers, which may be 
defined as a device for combining the variations that have occurred to a Self-Instructional Material 
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group of related variables over a period of time, to obtain a figure that 
represents the ‘net’ result of the change in the constitute variables. In this 
unit you will learn in detail about index numbers. 


14.1 OBJECTIVES 


After going through this unit, you will 


e Understand about index numbers and their types 
e Learn about the different methods of calculating index numbers 
e Know the uses and limitations of index numbers 


14.2 INDEX NUMBERS 


Index numbers are meant to study changes in the effects of factors 
which cannot be measured directly. According to Bowley, “Index 
numbers are used to measure the changes in some quantity which we 
cannot observe directly”. For example, changes in business activity in a 
country are not capable of direct measurement, but it is possible to study 
relative changes in business activity by studying the variations in the 
values of some such factors which affect business activity, and which are 
capable of direct measurement. 


Index numbers may be classified in terms of the variables that they 
are intended to measure. In business, different groups of variables in the 
measurement of which index number techniques are commonly used are 
(i) price, (11) quantity, (iii) value and (iv)business activity. Thus, we have 
an index of wholesale prices, index of consumer prices, index of 
industrial output, index of value of exports and index of business activity, 
etc. Here we shall be mainly interested in index numbers of prices 
showing changes with respect to time, although the methods described 
can be applied to other cases. In general, the present level of prices is 
compared with the level of prices in the past. The present period is called 
the current period and some period in the past is called the base period. 


14.2.1 TYPE OF INDEX NUMBER 

Index numbers are names after the activity they measure. Their 
types are as under: 
Price Index: Measure changes in price over a specified period of time. It 
is basically the ratio of the price of a certain number of commodities at 
the present year as against base year. 
Quantity Index : As the name suggest, these indices pertain to 
measuring changes in volumes of commodities like goods produced or 
goods consumed, etc. 
Value Index: These pertain to compare changes in the monetary value of 
imports, exports, production or consumption of commodities. 


14.2.2 PROBLEMS IN THE CONSTRUCTION OF INDEX 
NUMBERS 
The decision regarding the following problems/aspect has to be 
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taken before starting the actual construction of any type of index 
numbers. 

(i) Purpose of Index numbers under construction 

(ii) Selection of base period 

(iii) Selection of items 

(iv) Selection of source data 

(v) Collection of data 

(vi) Selection of average 

(vii) System of weighting 


14.2.3 METHODS OF CONSTRUCTING INDEX NUMBERS 
The index number for this purpose is divided into two: 


(1) Unweighted Index number 

e Simple aggregative 

e Simple Average of price relatives 
(2) Weighted Index number 

e Weighted aggregative 

e Weighted Average of price relatives 


Unweighted Index number: 


There are two methods of constructing unweighted index 
numbers: (1) Simple Aggregative Method (2) Simple Average of 
Relative Method 


Simple Aggregative Method 


In this method, the total price of commodities in a given (current) 
year is divided by the total price of commodities in a base year and 
expressed as percentage: 


Seer 
Poi =SP0 x 100 


Simple Average of Relative Method 


In this method, we compute price relatives or link relatives of the 
given commodities and then use one of the averages such as the 
arithmetic mean, geometric mean, median, etc. If we use the arithmetic 
mean as the average, then: 


Poi = Iy” x100 


The simple average of relative method is simpler and easier to apply than 
the simple aggregative method. The only disadvantage is that it gives 
equal weight to all items. 


Example : 


The following are the prices of four different commodities for 
2017 and 2018. Compute a price index with the (1) simple aggregative 
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method and (2) average of price relative method by using both the 


Index Numbers arithmetic mean and geometric mean, taking 2017 as the base. 
NOTES 
Commodity Cotton Wheat | Rice Grams 
Price in 2017 909 288 767 659 
Price in 2018 874 305 910 573 
Solution: 
Commodity | Price in Pricein | Price relative | log p 
2017 2018 P1 
P=— x 100 
Po P; PO 
Cotton 909 874 69.15 1.9829 
Wheat 288 305 105.90 2.0249 
Rice 767 910 118.64 2.0742 
Grams 659 573 86.95 1.9393 
Total x Pp = È P,= | XP = 407.64 Xx log P = 
2623 2662 8.0213 
1. Simple Aggregative Method 
i _ 2662 z 
Poi = x 100 = arias 100 = 101.49 
2. Simple Average of Price Relative Method ( using the arithmetic 
mean) 
Po. = —X—x 100 = = (407.64) x 100 = 101.91 
3. Average of price relative method ( using the geometric mean) 
Poi = antilog( ae) = antilog( sen) = 101.23 
Self-Instructional Material 
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Weighted Index number: 


When all commodities are not of equal importance, we assign 
weight to each commodity relative to its importance and the index 
number computed from these weights is called a weighted index number. 


Weighted aggregative index number: 


In order to attribute appropriate importance to each of the items 
used in an aggregate index number some reasonable weights must be 
used. There are various methods of assigning weights and consequently 
a large number of formulae for constructing index numbers have been 
devised of which some of the most important ones are: 


1. Laspeyre’s Index Number 

2. Paasche’s Index Number 

3. Fisher’s Ideal Index Number 

4. Marshal-Edge worth Index Number 


Laspeyre’s Index Number: 


In this index number the base year quantities are used as weights, 
so it also called the base year weighted index. 


_ ÈP1q0 
Poi = YP0q0 x100 


Paasche’s Index Number: 


In this index number the current (given) year quantities are used 
as weights, so it is also called the current year weighted index. 


_ ÈP1q1 
Poi = yP0qi x100 


Fisher’s Ideal Index Number: 


The geometric mean of Laspeyre’s and Paasche’s index numbers 
is known as Fisher’s ideal index number. It is called ideal because it 
satisfies the time reversal and factor reversal test. 


Poi = ./ Laspeyre’s Index x Paashe’s Index 


_ [xP1q0 | \iP1q1 
Poi = 4 SPoqo ~ FPOq1 x100 
Marshal-Edgeworth Index Number: 
In this index number the average of the base year and current year 


quantities are used as weights. This index number was proposed by two 
English economists, Marshal and Edgeworth. 
YP1q0 + YP1q1 


Poi = 
g= EP0q0 + Y:POq1 


— ÈP10 +41) 4100 
ZPO (q0 +q1) 


)x100 
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Example: 


ILI EES Compute the weighted aggregative price index numbers for 2011 
NOTES with 2010 as the base year using (1) Laspeyre’s Index Number (2) 
Paasche’s Index Number (3) Fisher’s Ideal Index Number (4) Marshal- 
Edgeworth Index Number. 
Prices Quantities 
Commodity | 2010 | 2011 2010 2011 
A 10 12 20 22 
B 8 8 16 18 
C 5 6 10 11 
D 4 4 7 8 
Solution: 
Prices Quantities 


Commodity | 5919 | 2011 | 2010 | 2011 | P10 | Pogo | Pigi | Pogi 


Po Pı q0 ql 


10 |12 |20 |22 |240 |200 |264 220 


8 8 16 | 18 128 | 128 | 144 144 


66 55 


JII alilwj|; > 
n 
nN 
— 
© 
— 
— 
DN 
© 
nn 
© 


XP qo XP0q0 =Piqi =Poqi 
= = =506 | =451 


Laspeyre’s Index Number: 
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= 2Plq0 100 = x 100 = 112.32 


Po, = 
01 ¥ Pogo 406 


Paasche’s Index Number: 
= Sx 100 = 2x 100 = 112.20 
Fisher’s Ideal Index Number 
Po: = /Laspeyre’s Index x Paashe’s Index 
Po, = V112.32 x 112.20 = 112.26 
Marshal-Edgeworth Index Number 


Po 


P1(q0+q1 456+506 
Po = ZP1(q0 +q1) 100 = ( 
XPO (q0 +q1) 406 +451 


)x 100 = 112.38 


Weighted average of price relatives: 


When the specific weights are given for each commodity the 
weighted index number is calculated by 


by 
Weighted Average of Price Relative index = pe 


Where w = the weight of the commodity 
P1 
p = the price relative index = pox 100 


When the base year value is Pogo is taken as the weight i.e. w = Pogo then 
the formula is 


P1 
(F5x 100)xPoq0 _ YP1q0 


= x 100 
P0q0 X P0q0 


Weighted Average of Price Relative index = } 


This is nothing but Laspeyre’s formula 
When the weight taken as w = Poq: then the formula is 


. . ook (Fax 100)xP0q1 
Weighted Average of Price Relative index = 2 e Sea 
yYP1q1 7 
xP0q1 


x 100 


This is nothing but Paasche’s formula 


Example: Compute the weighted index number for the following data 


Commodity | Price Weight 


Current year | Base year 


x 5 4 40 
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Y 3 2 60 
Index Numbers 
Z 2 1 20 
NOTES 
Solution: 
Price 
Commodity etree Base Weight E i PW 
year year =o 100 
X 5 4 40 125 5000 
3 2 60 150 9000 
Z 2 1 20 200 4000 
120 18000 


Spw _ 18000 


Weighted Average of Price Relative index = er 


14.2.4 QUANTITY OR VOLUME INDEX NUMBER 


Price index numbers measures and permit comparison of the price 
of certain goods; quantity index number, on the other hand, measures the 
physical volume of production, construction of employment. Though 
price indices are more widely used, production indices are highly 
significant as indicators of the level of output in the economy or in parts 
of it. 


= 150 


In constructing quantity index numbers, the problems confronting 
the statistician are analogous to those involved in price indices. We 
measure changes in quantities, and when we weigh we use prices or 
values as weights Quantity indices can be obtained easily by changing p 
to q and q to p in the various formulae discussed above. 


Thus, when Laspeyres method is used 


_ xq1po 
Qo. = Fqopo x100 


When Paasche’s formula is used 


_ xqipi 
Qo. = Sqop1 x100 


When Fisher’s formula is used 


Xqipo | yiqip1 
= x x100 
Qo: 2q0p0 = yiq0p1 
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These formulae represent the quantity index in which the 
quantities of the different commodities are weighted by their prices. 


Example: 


Compute the following quantity indices from the data given 
below (1) Laspeyre’s Index Number (2) Paashe’s Index Number (3) 
Fisher’s Ideal Index Number. 


2002 2012 
Commodity | Price Total value | Price Total value 
A 10 200 12 360 
B 12 480 15 900 
c 15 450 17 680 
Solution : 


Here instead of quantity, total values are given, hence find 
quantities of base year and current year 
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7 totalvalue 
Quantity = — e 
Commodity | po | qo | Pı | qı | Poqo | Poqı | Piqo | Pig 
A 10 | 20 | 12 | 30 | 200 | 300 | 240 | 360 
B 10 | 40 | 15 | 60 | 400 | 600 | 600 | 900 
C 15 |30 |17 |40 |450 |600 |510 |680 
Total 1050 | 1500 | 1350 | 1940 
, _ }q1p0 _ 1500 _ 
Laspeyre’s method Qoi = S qopo x100 = oso * 100 = 142.86 
> _ Xaipt _ 1940 = 
Paasche’s formula Qo) = Sai x100 = 730% 100 = 143.7 


Fisher’s formula Qo; = Zapo , žalP1 v 199 


2 q0p0 X} q0p1 
— AY = V % . 
as PAET RENT Self-Instructional Material 
= 143.27 
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14.2.5 TEST FOR INDEX NUMBER 


There are certain tests which are put to verify the consistency, or 
adequacy of an index number formula from different points of view. The 
most popular among these are the following tests: 

e Order reversal test. 
e Time reversal test. 
e Factor reversal test. 
e Unit test. 


At the outset, it should be noted that it is neither possible nor 
necessary for an index-number formula to satisfy all the tests mentioned 
above. But, an ideal formula should be such that it satisfies the maximum 
possible tests which are relevant to the matter under study. 

1. Order reversal test: 

This test requires that a formula of Index number should be such 
that the value of the index number remains the same, even if, the order of 
arrangement of the items is reversed, or altered. As a matter of fact, this 
test is satisfied by all the formulas of index number explained in this 
chapter. 


2. Time reversal test: 


The time reversal test requires that the index number computed 
backwards should be the reciprocal of the index number computed 
forward, except for the constant of proportionality 


P1q0 P1q1 

Poi = xP1q x q 
YP0qgO YP0q1 

P0q1 P0q0 

Pee XPOq x q 
YPiqi »>P1q0 

P1q0 P1q1 P0q1 P0q0 

Poi x Pio = ŁP1q x q x> q x> q 
YP0qgO YPOqi YPiqi >P1q0 


Por x Pio = 1 

Laspeyre’s and Paasche‘s method do not satisfy this test but 
Fisher’s ideal index satisfies this method. Besides both the simple and 
weighted geometric mean of piece relatives, also, satisfy this time 
reversal test. 

3. Factor reversal test: 

This test has also been put forth by Prof. Irving Fisher, in this test 
the product of price index and quantity index must be equal to the value 
index. Thus, for the Factor Reversal test, a formula of index number 
should satisfy the following equation: 

Price index x Quantity Index = Value Index 
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¥YP1q0_ YP1qi 


Po; = x 
01 — | 5P0q0 ~ YP0q1 
_ |}q1p0 | yiqip1 
Qoi = x 
Yqopo = yiqOp1 
~piq1 
~ Po xQu= 2 
Q Ypoqgod 


Most of the formulae of index number discussed above fail to 


satisfy this acid test of consistency except that of Prof. Irving Fisher. 


4.Unit test 


This test suggests that the formula for constructing an index 
should be independent of the unit of measurement in which the prices 
and quantities are quoted. Except unweighted aggregative index number 


all other formulas in this chapter satisfy this test. 


Example: 


Construct Fisher’s ideal index for the following data. Test whether it 


satisfies time reversal test and factor reversal test. 


Commodity | Base year Current year 
Quantity | Price Quantity | Price 
A 24 20 30 24 
B 30 14 40 10 
C 10 10 16 18 
Solution: 
Commodity | qo po qı pı Poqo | Poqı | P1qo | P1qi 
A 24 |20 | 30 | 24 | 480 600 576 720 
B 30 | 14 |40 | 10 | 420 560 300 400 
C 10 |10 |16 |18 | 100 160 180 288 
1000 | 1320 | 1056 | 1408 
Fisher ideal index number Po, = |2210 y 2°441 » 190 = [1056 y 1408 y 100 
X P0q0” X POq1 1000 ” 1320 


= V1.056 x1.067 x100 =v1.127 x100 
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= 1.062 x 100 =106.2 


Time Reversal test: 


Time Reversal test is satisfied when Po; x Pio = 1 


P1q0 | yP1q1 1056 1408 
Po; = A xe g = —— xX ——— 

YP0q0 ~~ YP0q1 1000 ~ 1320 

P0q1 YPO0q0 1320 1000 
Pye xP0q ne q0 _ y 

YP1q1 ~~ YP1q0 1408 ~ 1056 

1056 1408 1320 1000 

Po. x Pip = [= x ——— x — = 1 =1 
ONAT IO 1000 ~ 13201408 ~ 1056 v1 


Hence Fisher ideal index satisfy the time reversal test 


Factor Reversal Test: 


YP1q0  YPiq1 1056 1408 
Poi = x = |—— x 
X P0q0” ¥ P0q1 1000 “ 1320 


1056 1408 1320 _ 1408 


Poi X = |x y y = 
o1 X Qoi 1000 ~ 1320 ~ 1000 ~ 1056 ( 


_ 1408 _ Yplql 
~ 1000 = Ypoqgo 
Hence Fisher ideal index number satisfy the factor reversal test 


14.2.6 CHAIN BASE INDEX NUMBER 


In this method, there is no fixed base period; the year 
immediately preceding the one for which the price index has to be 
calculated is assumed as the base year. Thus, for the year 1994 the base 
year would be 1993, for 1993 it would be 1992, for 1992 it would be 
1991, and so on. In this way there is no fixed base and it keeps on 
changing. 
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The chief advantage of this method is that the price relatives of a 
year can be compared with the price levels of the immediately preceding 
year. Businesses mostly interested in comparing this time period rather 
than comparing rates related to the distant past will utilize this method. 


A ; Price in the Current Year 
Link relative of current years =_—_—_———_——_— x 100 
Price in the preceding Year 


Pn—1,n = x100 


Pn 
Pn-1 
Example: 


Find the index numbers for the following data taking 2010 as the base 
year 


Years | 2004 2005 2006 2007 2008 2009 


Pri 18 21 25 23 28 30 
rice 
Solution: 
Year Price | Link Relatives | Chain indices 
Pn 
Pn ņ*100 
18 
2004 |18 z5x 100 = 100 MR 
21 21 100 x 116.67 
2005 ois = a ee 
13” 100 = 116.67 100 116.7 
25 25 116.67 x 119.05 
2006 pa = E Ss ears 
4% 100 =119.05 100 
= 138.9 
23 23 138.9 x 92 
2007 =a = eS 
55% 100 = 92 100 127.79 
28 28 127.79 x 121.74 
2008 =e = a ae ice 
53% 100 = 121.74 100 
= 155.57 
30 30 155.57 x 107.14 
2009 — x 100 = 107.14 — 
28“ 100 
= 166.68 
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CHECK YOUR PROGRESS - 1 


1.What is chain base index number 
2. What is the formula for Fisher’s Ideal Index Number? 
3. What is weighted index number? 


14.3 COST OF LIVING INDEX NUMBER 


Cost of living index numbers measure the changes in the prices 
paid by consumers for a special “basket” of goods and services during 
the current year as compared to the base year. The basket of goods and 
services will contain items like (1) Food (2) Rent (3) Clothing (4) Fuel 
and Lighting (5) Education (6) Miscellaneous like cleaning, transport, 
newspapers, etc. Cost of living index numbers are also called consumer 
price index numbers or retail price index numbers. 


14.3.1 CONSTRUCTION OF COST OF LIVING INDEX 
NUMBERS 


The following steps are involved in the construction of Cost of living 
index numbers. 


(1) Class of People: 


The first step in the construction of the Cost of living index (CLI) 
is that the class of people should be defined clearly. It should be decided 
whether the cost of living index number is being prepared for industrial 
workers, or middle or lower class salaried people living in a particular 
area. It is therefore necessary to specify the class of people and locality 
where they reside. 


(2) Family Budget Inquiry: 


The next step in the construction of a Cost of living index number 
is that some families should be selected randomly. These families 
provide information about the cost of food, clothing, rent, miscellaneous, 
etc. The inquiry includes questions on family size, income, the quality 
and quantity of resources consumed and the money spent on them, and 
the weights are assigned in proportions to the expenditure on different 
items. 


(3) Price Data: 


The next step is to collect data on the retail prices of the selected 
commodities for the current period and the base period when these prices 
should be obtained from the shops situated in the locality for which the 
index numbers are prepared. 


(4) Selection of Commodities: 
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The next step is the selection of the commodities to be included. 
We should select those commodities which are most often used by that 
class of people. 


14.3.2 METHODS TO COMPUTE COST OF LIVING INDEX 
NUMBERS 

There are two methods to compute cost of living index numbers: 
(1) Aggregate Expenditure Method (2) Family Budget Method 


Aggregate Expenditure Method 

In this method, the quantities of commodities consumed by the 
particular group in the base year are estimated and these figures or their 
proportions are used as weights. Then the total expenditure of each 
commodity for each year is calculated. The price of the current year is 
multiplied by the quantity or weight of the base year. These products are 
added. Similarly, for the base year the total expenditure of each 
commodity is calculated by multiplying the quantity consumed by its 
price in the base year. These products are also added. The total 
expenditure of the current year is divided by the total expenditure of the 
base year and the resulting figure is multiplied by 100 to get the required 
index numbers. In this method, the current period quantities are not used 
as weights because these quantities change from year to year. 


Here, 

Pı - Represent the price of the current year, 

Po - Represents the price of the base year and 

qo- Represents the quantities consumed in the base year. 


Family Budget Method: 


In this method, the family budgets of a large number of people are 
carefully studied and the aggregate expenditure of the average family for 
various items is estimated. These values are used as weights. The current 
year’s prices are converted into price relatives on the basis of the base 
year’s prices, and these price relatives are multiplied by the respective 
values of the commodities in the base year. The total of these products is 
divided by the sum of the weights and the resulting figure is the required 
index numbers. 


Py, =" x100 
ZW 
Here, IF x100 and YW=Poqo 


Example: 
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Index Numbers Construct the cost of living index number for 2018 on the basis of 
2017 from the following data using (1) Aggregate Expenditure Method 


NOTES (2) Family Budget Method. 
Commodity | Quantity Prices 
Consumed in 2017 2017 2018 
(in quintal ) 
A 6 315.75 | 316.00 
B 6 305.00 | 308.00 
C 1 416.00 | 419.00 
D 6 528.00 | 610.00 
E 4 120.00 | 119.50 
F 1 1020.00 | 1015.00 
Solution: 
The cost of living index number of 2018 by Aggregate Expenditure 
method: 
Quantity Prices 
Commodity Consumed l agr 08 | Piqo Poqo 
in 2017 
(in quintal ) Po a 
qo 
A 6 315.75 | 316.00 | 1896 1894.50 
B 6 305.00 | 308.00 | 1848 1830.00 
C 1 416.00 |419.00 |419 416.00 
D 6 528.00 | 610.00 |3660 3168.00 
E 4 120.00 | 119.50 |478 480.00 
F 1 1020.00 | 1015.00 | 1015 1020.00 
Self-Instructional Material Zo Pigo | Pado = 
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= 9316 


8808.5 


The cost of living index number of 2018 is 


x100 = 


9316 


8808.5 


x100 


= 105.76 


The cost of living index number of 2018 by Family Budget Method: 


Quantity | Prices W=Poq_ | I= Product 
yP1 
Commodit | Consume 9 spo 100 WI 
y din2017 |2017 | 2018 
(in quintal | p, P; 
) qo 
A 6 315.75 | 316.00 | 1894.5 | 100.08 189601.56 
B 6 305.00 | 308.00 | 1830.0 | 100.98 184793.40 
C 1 416.00 | 419.00 | 416.0 100.72 41899.52 
D 6 528.00 | 610.00 | 3168.0 | 115.53 365999.04 
E 4 120.00 | 119.50 | 480.0 99.58 47798.4 
F 1 1020.0 | 1015.0 | 1020.0 | 99.51 101500.20 
0 0 
xW = xWI 
8808.5 =931592.1 
2 


The cost of living index number of 2018 is 


Py, =x 100 


ZW 


_ 931592 .12 


8808.5 


= 105.76 


14.3.3 USES OF COST OF LIVING INDEX NUMBER 


e They indicate the changes in the consumer prices. Thus they help 
government in formulating policies regarding control of price, 


taxation, imports and exports of commodities, etc. 


e They are used in granting allowances and other facilities to 


employees 


e They are used for the evaluation of purchasing power of money. 
They are used for deflating money 
e They are used for comparing changes in the cost of living of 


differenc classes of people 
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14.4 USES OF INDEX NUMBER 


The main uses of index numbers are given below. 


Index numbers are used in the fields of commerce, meteorology, 
labour, industry, etc. 

Index numbers measure fluctuations during intervals of time, 
group differences of geographical position of degree, etc. 

They are used to compare the total variations in the prices of 
different commodities in which the unit of measurements differs 
with time and price, etc. 

They measure the purchasing power of money. 

They are helpful in forecasting future economic trends. 

They are used in studying the difference between the comparable 
categories of animals, people or items. 

Index numbers of industrial production are used to measure the 
changes in the level of industrial production in the country. 

Index numbers of import prices and export prices are used to 
measure the changes in the trade of a country. 


Index numbers are used to measure seasonal variations and cyclical 
variations in a time series. 


14.5 LIMITATIONS OF INDEX NUMBER 


They are simply rough indications of the relative changes. 

The choice of representative commodities may lead to fallacious 
conclusions as they are based on samples. 

There may be errors in the choice of base periods or weights, etc. 
Comparisons of changes in variables over long periods are not 
reliable. 

They may be useful for one purpose but not for another. 

They are specialized types of averages and hence are subject to all 
those limitations which an average suffers from. 

CHECK YOUR PROGESS - 2 


4. What are the methods to compute Cost of Living Index 
numbers? 


5. What are thepopularTests for Index number? 
6.Write a few uses of index number. 


14.6 SUMMARY 


° Index numbers are meant to study changes in the 
effects of factors which cannot be measured directly. 
According to Bowley, “Index numbers are used to 
measure the changes in some quantity which we cannot 
observe directly”. 
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. Price Index Quantity Index Value Index.Quantity Index 

Numbers are the types of index numbers. 

e Price index numbers measures and permit comparison of 
the price of certain goods; quantity index number, on the 
other hand, measures the physical volume of production, 
construction of employment. Though price indices are 
more widely used, production indices are highly 
significant as indicators of the level of output in the 
economy or in parts of it. 

e There are certain tests which are put to verify the 
consistency, or adequacy of an index number formula 
from different points of view. The most popular among 
these are the following tests: (1)Order reversal test.(2) 
Time reversal test. (3) Factor reversal test. (4)Unit test. 

e In this method, there is no fixed base period; the year 
immediately preceding the one for which the price index 
has to be calculated is assumed as the base year. 

e Cost of living index numbers measure the changes in the 
prices paid by consumers for a special “basket” of goods 
and services during the current year as compared to the 
base year. 

e There are two methods to compute cost of living index 

numbers: (1) Aggregate Expenditure Method (2) Family 

Budget Method. 


8.7 KEY WORDS 


Index numbers,Price Index, Quantity Index, Value Index, Laspeyre’s 
Index Number, Paasche’s Index Number, Fisher’s Ideal Index 
Number, Marshal-Edge worth Index Number,Order reversal test, 
Time reversal test, Factor reversal test, Unit test,Chain Base index 
number, Cost of living index number. 


8.8 ANSWERS TO CHECK YOUR PROGRESS 


1. In this method, there is no fixed base period; the year immediately 
preceding the one for which the price index has to be calculated is 
assumed as the base year. 


2.Po; = y Laspeyre’s Index x Paashe’s Index 


_ [XP1q0__ yiP1q1 
Fore y SPoqo ~ pogr * 109 
3.When all commodities are not of equal importance, we assign weight to 


each commodity relative to its importance and the index number 
computed from these weights is called a weighted index number. 


4.There are two methods to compute cost of living index numbers: (1) 
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Aggregate Expenditure Method (2) Family Budget Method 


Index Numbers 5. Order reversal test, Time reversal test, Factor reversal test, Unit test 


NOTES 6. Index numbers are used in the fields of commerce, meteorology, 
labour, industry, etc.Index numbers measure fluctuations during intervals 
of time, group differences of geographical position of degree, etc. They 
are used to compare the total variations in the prices of different 
commodities in which the unit of measurements differs with time and 
price, etc. They measure the purchasing power of money.They are helpful 
in forecasting future economic trends 


14.9 QUESTIONS AND EXERCISE 


SHORT ANSWER QUESTIONS 


1. Define index number and write the uses of index numbers 
2. State the types of index numbers 
3. State the methods of constructing consumer price index 


LONG ANSWER QUESTIONS 


1. Compute (1) Laspeyre’s (2) Paasche’s index number for the 2001 
from the following 


Commodity Price Quantity 
[2002 [2010 [|2002 [2010 | 
wW 4 6 8 7 
x 3 5 10 8 
y 2, 4 14 12 
Z 5 7 19 11 


2. Calculate Fisher’s ideal index method for the following data 


Commodity 2011 2012 
Price Quantity Price Quantity 
A 2 7 3 5 
B 5 11 6 10 
c 3 14 5 11 
D 4 16 4 18 


3. Construct the consumer price index number of 2015 on the from 
the following data using 
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(i) the average expenditure method and Index Numbers 


(ii) the family budget method 


NOTES 
Commodity | Quantity Prices 
Consumed in 2014 2014 2015 
A 6 Kg 5 7 
B 6 Quintal 6 6 
C 5 Quintal 5 4 
D 6 Quintal 7 7 
E 4 Quintal 8 8 
F 5 Kg 9 9 
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DISTANCE EDUCATION — CBCS - (2018 — 2019 Academic Year Onwards) 
Question Paper Pattern - BUSNESS STATISTICS 
(UG Programs) 
Time: 3 Hours Maximum: 75 Marks 
Part — A (10 x 2 = 20 Marks) 
Answer all questions 


What is chi-square test? 

What is meant by analysis of variance? 
Write short note on index number. 
Describe Type II error 

Write any four advantages of statistics 
Explain the term Probable Error 

What is meant by Binomial Distribution? 
Define the term “Correlation” 

. What is meant by Regression? 

10. What is forecasting? 


CON AMHR WN 


Part — B (5 x 5 = 25 Marks) 
Answer all questions choosing either (A) or (B) 


11. (A) what are the importances of statistics? 


(or) 
(B) Calculate median from the following data 
Profit : 100-200 200-300 300-400 400-500 500-600 600-700 700- 800 
No. of Shops: 10 18 20 26 30 28 18 
12. A) Find rank correlation. 
X i 1 2 3 4 5 
Y : 5 4 3 4 l 
(or) 


(B) What is the probability that a leap year will contain 53 Sundays? 


13. (A) The manufacture of a certain make of electric bulbs claims that the bulbs have a mean life of 25 
months with a life of 25 onths with a standard deviation of 5 months. A random sample of 6 such 
bulbsgave the following values. 


Life of months: 24 26 30 20 20 18 
Can you regard the producer’s claim to be valid at 1% level of significance? V = 4.032 


(or) 


(B). Find standard deviation from the following observations. 
Size : 120 125 130 135 140 145 150 155 160 
Frequencies :2 3 3 1 2 7 4 2 8 


14. (A) The first of two samples consists of 23 pairs and gives a correlation coefficient of 0.5 while 
the second of 28 pairs has the correlation coefficient of 0.8. Are these values significantly 
different? 


(or) 


(B) Explain in brief the assumptions of ANOVA. 


15. (A) Calculate the 3 yearly and 5 yearly moving averages of the data 
Years}1 |2 |3 |4 15 |6 |7 |8 |9 | 10 | 11 | 12 


sales | 5.2 | 4.9 | 5.5 | 4.9 | 5.2 | 5.7 | 5.4 | 5.8 | 5.9 | 6.00 | 5.2 | 4.8 


(or) 
(B) Calculate trend values by the method of least square from data given below and estimate the 
sales for 2003 
Years: 1996 | 1997 | 1998 | 1999 | 2000 


Sales of Co.A, (% Lakhs) | 70 74 80 86 90 


Part — C (3 x 10 = 30 Marks) 
Answer any three out of five questions 
16. Eight coins are tossed simultaneously 256 times. Number of heads observed at each throw is 
recorded as given below: 


No. of Heads: 0 1 2 3 4 5 6 7 
Frequencies : 2 6 30 52 67 56 32 10 1 


Fit a binomial distribution and the expected frequency. Also find the mean and standard 
deviation 


17. The quantity of raw materials purchased by a company at the specified prices during the 12 


months of 2017 is given as follows: 


18. 


19. 


Months | J F M A M J JY A S O N DEC 
Price in 96 110 |100 |90 86 92 112 |112 |108 |116 | 86 92 
Rs 
Quantity | 250 |200 |250 |280 | 300 |300 |220 |220 | 200 | 210 |300 | 250 
Per kg 

a. Find the regression equations based on the above data. 

b. Can you estimate the approximate quantity likely to be purchased if the price shoots up to 

Rs.124 per kg? 

200 digits are chosen at random from a set of tables. The frequencies of the digits are as follows 

Digit : 0 1 2 3 4 5 6 7 8 9 

F : 18 19 23 21 16 25 22 20 21 15 

Use Chi-Square test to assess the correctness of the hypothesis that the digits were distributed in 


equal numbers in the tables from which they were chose. V, 9 = 16.22 


The sales data of an item in six shops before and after a special promotional campaign are as under: 
Before Campaign: 53 28 31 48 50 42 
After Campaign : 58 29 30 55 56 45 


Can the campaign be judged to be a success? Test at 5% level of significance. V= t0.05 = 2.57 


20. Construct L & P method of index. 


Price Quantity 
Types 2012 2014 2012 2014 
X 9 15 5 5 
Y 4 12 10 11 
Z 1 5 6 6 


